influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.8k stars 3.55k forks source link

Backoff policy for Subscriptions #5462

Closed kostasb closed 5 years ago

kostasb commented 8 years ago

Consider implementing a backoff policy for when data points cannot be delivered to InfluxDB subscriptions. A high number of connection attempts may overwhelm the system if it keeps trying to send data when the subscriber is not listening (incidents have been reported).

Also, might be meaningful to reformat the logging output.

[subscriber] 2016/01/27 08:00:00 write udp 127.0.0.1:59505: connection refused

Two considerations: -udp is connectionless so it might be better to indicate failures as "timeout" or "udp write failed" -it would help in troubleshooting to log the subscription's name in addition to [subscriber]

kostasb commented 8 years ago

@nathanielc Any thoughts?

nathanielc commented 8 years ago

A few thoughts....

  1. They most likely spun in Kapacitor locally for testing and pointed it at the hosted instance. This would have automatically created subscriptions for the databases in InfluxDB pointing at localhost. This is the default behavior so its likely we will see it a lot more. We need to think about this workflow. For non hosted users its the easiest no config setup.
  2. The traffic as you stated is UDP and connectionless, but we do get an error from the attempt to send the packet so we can implement backoff.
  3. The logging message is accurate in my mind and is coming from the golang net package so changing it would require detecting that specific error and re-writing it, not ideal.
  4. Adding the name of the subscription to the log output is a good idea.
rossmcdonald commented 8 years ago

Related to issue #5559

jsternberg commented 8 years ago

When implementing the backoff policy, should InfluxDB give up after a certain number of tries? The Wikipedia article recommends giving up after 16 attempts.

nathanielc commented 8 years ago

@jsternberg Are you planning on adding this? We should probably discuss more first. As Its been a while since this was requested and subscriptions have change a bit since.

In general the plan around subscriptions was to not implement retries so that the you don't overwhelm the database itself. If you want smart retry logic use the influxdb-relay as its already built in.

jsternberg commented 8 years ago

I was going through things labeled support and I was going to attempt to implement them. If this is no longer needed, we should close it. I have no opinion on whether or not this is important or not at the moment.

nathanielc commented 8 years ago

@kostasb Is this still an issue? Since now we have the influxdb-relay, which can handle retries, seems like we recommend using the relay instead of subscriptions if retries are needed.

Also to be clear subscriptions currently do not have any retry logic, so there is no stampeding or other side effects of attempting retries.

kostasb commented 8 years ago

@nathanielc HArelay can be used as a queuing mechanism for any scenario that needs buffering/retry logic in Line Protocol. But from support's perspective we are reluctant to suggest the use of a tool that is not being supported by the core team, for anything other than the original purpose of custom high-availability setups. E.g. we would not be able to suggest the use of InfluxRelay to any paying customers.

Implementing this logic is more of a design decision. I am not sure whether buffering/queuing is what we want in this case, rather than just a way to back off from forwarding points to Kapacitor while the latter fails to ingest them.

russorat commented 6 years ago

I ran into this issue as well when trying to move from http to https kapacitor. Influx held onto the http subscription.

So show subscriptions returned

> show subscriptions
name: _internal
retention_policy name                                           mode destinations
---------------- ----                                           ---- ------------
monitor          kapacitor-d561ce59-c5d2-4b03-a5a4-3b21a6f6e073 ANY  [http://localhost:9092]
monitor          kapacitor-f8b76969-9431-4929-99c8-70cf9314812a ANY  [https://localhost:9092]

i was seeing this in the influx logs 2018-03-28T16:52:40.009234Z info Post http://localhost:9092/write?consistency=&db=_internal&precision=ns&rp=monitor: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x01\x00\x02\x02\x16" {"log_id": "077O85al000", "service": "subscriber"}

and this in the kapacitor logs ts=2018-03-28T10:06:30.007-07:00 lvl=info msg="http request" service=http host=::1 username=- start=2018-03-28T10:06:30.005690549-07:00 method=POST uri=/write?consistency=&db=_internal&precision=ns&rp=monitor protocol=HTTP/1.1 status=204 referer=- user-agent=InfluxDBClient request-id=5bc1cd54-32aa-11e8-8021-000000000000 duration=1.820401ms ts=2018-03-28T10:06:30.007-07:00 lvl=error msg="2018/03/28 10:06:30 http: TLS handshake error from [::1]:55794: tls: oversized record received with length 21536\n" service=http service=httpd_server_errors

I had to manually remove the subscription using drop subscription "kapacitor-d561ce59-c5d2-4b03-a5a4-3b21a6f6e073" on _internal.monitor

somehow between influxdb and kapacitor, the old http subscription should have been automatically removed.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.