influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.84k stars 3.55k forks source link

InfluxDB holds TCP connections open to disconnected telegraf agents #14830

Open mattwwarren opened 5 years ago

mattwwarren commented 5 years ago

Steps to reproduce: List the minimal actions needed to reproduce the behavior.

  1. Run influxdb
  2. Run telegraf from a number of remote hosts
  3. Stop remote hosts, start, stop (repeat a few times)
  4. Run lsof -P | grep influx | awk -F':8086' '{print $2}' | awk -F':' '{print $1}' | sort | uniq -c | sort -nk 1 on the influx host to see increasing number of open connections to influx from telegraf agents

Expected behavior: When telegraf shuts down, influx closes open connections

Actual behavior: Influx continues to hold connections open until open file handle limits are reached

Environment info:

Config: To my knowledge, we have no custom config settings. I am happy to provide any options if specific values are useful.

Sample lsof output:

 18 ->widget-i-057bf12b491e0e34b.dev
 19 ->build-i-08f663f229cb8c4a6.dev
 19 ->cortex-i-03c69377b8cd3ff64.dev
 19 ->frontdocker_manager-i-0c88604c9074d091b.dev
 19 ->livescrape-i-0887216b1044ebae1.dev
 19 ->mail2tix-i-0a4b0b7adee0b3781.dev
 19 ->route-i-0f6cdabe9a9f9985a.dev
 19 ->routev2-i-09e584b79f68c5047.dev
 19 ->sapi-i-02c61a320842039aa.dev
 19 ->session-i-072b25e4dd0e3a25d.dev
 19 ->site-i-0bdce76332b7dfec9.dev
 19 ->static-i-0f342c1dbf8671e0c.dev
 19 ->storage-i-02979f25bbb505687.dev
 19 ->txn-i-02ef77f68772dab44.dev
 19 ->uiweb-i-01d51a1aff1a817dc.dev
 19 ->wapi-i-03c932cbd6ae7fee0.dev
 20 ->cdedocker-i-034b06ba85cbcd041.dev
 20 ->frontdocker-i-06760dffd61212d0d.dev
 20 ->middocker-i-0d3cf5df7aa445fad.dev
 20 ->mongo-i-01024efdb7ae846ad.dev
 20 ->mongo-i-069cff890881443d4.dev
 20 ->seo-i-0df60a00156b5c4f0.dev
 23 ->scheduler-i-09a8440089c187bb4.ci
 24 ->scheduler-i-0007248c09b8dd029.dev
 24 ->scheduler-i-0264a79cb9e6c261f.stage
 24 ->scheduler-i-0a371f43677d77710.pilot
 31 ->scheduler-i-0dfd32be39e68f4b7.dev

Our non-prod hosts shutdown at night, leaving connections open. Prod hosts do not shutdown and their connection counts stay at 1

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mattwwarren commented 4 years ago

Please do not close this issue. This is still a weekly problem for us.

wofanli commented 4 years ago

@mattwwarren We hit the same issue with you.

Looks like Influxdb does not enable TCP-Keepalive, and Linux keeps the leaked established TCP connections as ESTABLISHED state.

In our case, we have telegraf running on a remote server. The network link is lossy between telegraf server and influxdb server.

When telegraf decide to close tcp connection, the FIN packet might be lost. After that, Influxdb would wait for the http body forever. And the TCP connection is leak.

I'm wondering could influxdb support some kind of timeout mechanism in such scenario.

prashanthjbabu commented 3 years ago

I ran into the same problem , the following thread helped me https://github.com/influxdata/influxdb/issues/9248