Telegraf stops publishing metrics to InfluxDB; All plugins take too long to collect

jgitlin-bt commented 6 years ago

Bug report

After a seemingly random amount of time, Telegraf stops publishing metrics to InfluxDB over UDP. I have been experiencing this issue since Nov 2016 on both Telegraf 1.3.x and 1.4.4 on FreeBSD, on three separate servers. In the telegraf log, all collectors start to fail with

Error in plugin [inputs.$name]: took longer to collect than collection interval (10s)

I can't see anything unusual or interesting published from the Telegraf internal metrics.

This same issue has been reported in #3318, #2183, #2919, #2780 and #2870 but all those issues are either abandoned by the requestor, or confused with several separate issues; I am pening a new issue for my specific problem but if it's a duplicate (#3318 seems to be the closest) then please feel free to close

Relevant telegraf.conf:

telegraf.conf

System info:

Telegraf v1.4.4 (git: unknown unknown) running on FreeBSD 10.3-RELEASE-p24

Steps to reproduce:

service telegraf start
wait $random time period

Expected behavior:

Telegraf publishes metrics to InfluxDB server over UDP

Actual behavior:

Telegraf stops publishing metrics seemingly randomly, all input plugins start to fail with:

2018-01-02T17:16:50Z E! Error in plugin [inputs.processes]: took longer to collect than collection interval (10s)
2018-01-02T17:16:51Z E! Error in plugin [inputs.apache]: took longer to collect than collection interval (10s)
2018-01-02T17:16:51Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-02T17:16:51Z E! Error in plugin [inputs.internal]: took longer to collect than collection interval (10s)
2018-01-02T17:16:51Z E! Error in plugin [inputs.system]: took longer to collect than collection interval (10s)
2018-01-02T17:16:51Z E! Error in plugin [inputs.swap]: took longer to collect than collection interval (10s)
2018-01-02T17:16:52Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-02T17:16:58Z E! Error: statsd message queue full. We have dropped 1 messages so far. You may want to increase allowed_pending_messages in the config
2018-01-02T17:17:00Z E! Error in plugin [inputs.phpfpm]: took longer to collect than collection interval (10s)
2018-01-02T17:17:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (10s)

Additional info:

Full logs and stack trace

Earlier occurrence

Grafana snapshot of Internal Telegraf metrics

agnivade commented 6 years ago

Does the problem persist if you switch to using tcp ? I have seen this a few times but its pretty rare to cause an issue. I have used tcp to send to influx so just wanted to check.

Also, can you turn off the logparser input plugin and check if that causes any improvement ? The logparser uses some heavy regexes which burn considerable CPU.

Also, if you are sending lots of data in statsd, you should increase allow_pending_messages to 10000 or so.

jgitlin-bt commented 6 years ago

Just had another crash, same server, less than an hour after restarting Telegraf. logs

jgitlin-bt commented 6 years ago

@agnivade thanks for the suggestions! As a last resort for debugging I can try TCP, but the overhead of TCP is seriously undesirable in our setup. We require a UDP based solution.

After the next crash, I'll disable logparse and see if that causes any improvement. (It's one of the most critical plugins for our usage so I hope that's not the problem!)

I'll increase allow_pending_messages - I have been adjusting that to try and debug this issue with no noticeable effect

jgitlin-bt commented 6 years ago

allowed_pending_messages was 10,000. I just set allowed_pending_messages = 50000 to see if that makes any difference.

jgitlin-bt commented 6 years ago

This may be related to metric_batch_size... I have been playing with that value in an attempt to debug this, and the issue happens less frequently with larger values (E.G. a server with metric_batch_size = 2000 has worked for >7 days)

agnivade commented 6 years ago

tcp should not be an overhead if you do batching properly. Yes, metric_batch_size is certainly an important setting to tweak.

Keep us posted on your changes.

danielnelson commented 6 years ago

Sounds that for some reason the Accumulator is not being read from anymore, and its internal channel is full. In the past I have seen this caused by a stuck output, though I don't see this in your stack trace. I think increasing the pending messages will only delay this message from appearing after processing is stuck.

Can you try disabling the aggregator and see if it helps? Maybe I didn't actually solve #2914

jgitlin-bt commented 6 years ago

@danielnelson I suspected the aggregator might be the issue also. I disabled it on our 3rd server, and haven't seen a problem on that server since. I just lost metrics from server1, and disabled it there also. here's the log and stack trace from that server

Luckily I am not actually using the aggregator, so I really should have it disabled anyway. (Turned it on because it seemed useful but when I started building dashboards I never actually needed it's output)

sujeeth961991 commented 6 years ago

I am having this same issue, and have seen multiple related reports, but haven't found anything that has helped resolve it. I haven't used any plugin other than inputs but still facing this issue. Please keep us posted if anyone has found a solution for it.

jgitlin-bt commented 6 years ago

@sujeeth961991 so you're not using the aggregator plugin? If you post a stack trace when telegraf is stuck by sending it a SIGQUIT, does it look similar to mine?

I disabled the aggregator 5 days ago and haven't seen a single issue since, making me think that in my case that may be the issue.

epsylonix commented 6 years ago

It seems like we're experiencing this too, the aggregator plugin seems to be the problem, because there were no issues before I have enabled it. The telegraf process seems to be running but it doesn't report any metrics, no errors in logs though, after restart it started posting to influxdb again. I'll post a stacktrace next time it happens.

sujeeth961991 commented 6 years ago

@jgitlin-bt I had restarted my telegraf agent. I will post a stacktrace next time when this ihappens.

sgreszcz commented 6 years ago

I had the exact same problem today where telegraf died on one server over the holiday shutdown. I had to restart it. What type of debugging data should I collect next time it happens before restarting?

danielnelson commented 6 years ago

@sgreszcz Kill Telegraf with SIGQUIT and it should print a stack trace. Are you also using an aggregator?

jgitlin-bt commented 6 years ago

Since disabling the aggregator I have had 6 days of no lockups, which is a record. Over the past 30 days I never made it more than 5 days without at least one server deadlocking. I'm going to start re-enabiling other inputs I disabled (like disk, network, etc) and see if things still look OK. Updates to follow.

sgreszcz commented 6 years ago

@danielnelson I'm not using the aggregator, just collectd (socket listener), docker, internal. I have the same configuration on six servers collecting the same data (but from different devices) and only one of my telegraf instances has blown out so far with the "took longer to collect than collection interval (10s). It seems to be only the local inputs that are complaining, not the collectd forwarder using the socket_listener input.

Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.system]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.internal]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.processes]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.linux_sysctl_fs]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.diskio]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.kernel]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.docker]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.disk]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.swap]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.net]: took longer to collect than collection interval (10s)

Loaded inputs: inputs.linux_sysctl_fs inputs.disk inputs.diskio inputs.swap inputs.system inputs.docker inputs.net inputs.socket_listener inputs.cpu inputs.kernel inputs.mem inputs.processes inputs.internal

jgitlin-bt commented 6 years ago

That's interesting @sgreszcz because for me, ever single input stops working, even the internal telegraf one when I disable that. The host stops publishing metrics altogether. We may have two separate issues, I am not sure. (Don't use collectd or docker, do use statsd input)

danielnelson commented 6 years ago

I wouldn't expect socket_listener to complain since it is a "service input". This means that instead of being called each interval it is event driven: it adds new metrics when it receives on on its socket.

I suspect even though there is no log message that you cannot send items to the socket_listener and have them delivered to an output.

sgreszcz commented 6 years ago

@jgitlin-bt @danielnelson

I should have been a bit clearer. Although socket_listener wasn't complaining in the logs, I wasn't getting any metrics of any type via that Telegraf process to my central InfluxDB server.

epsylonix commented 6 years ago

@danielnelson my telgraf config includes udp socket_listener, basicstats aggregator and influx output. when debug is enabled and telegraf hangs, I see it just just stops writing logs, metrics are not published either.

I tried to find how to reliably make it hang, this does it in a matter of seconds

script first start telegraf with udp socket_listener, basicstats aggregator, output to influx then run the script with parallel parallel --no-notice -n0 -j0 ./telegraf_spammer.sh ::: {1..4}

stack trace

By the way, for some reason when decreasing the cardinality to 3 tags with 5 different values each, telegraf works fine.

danielnelson commented 6 years ago

@epsylonix Thank you, I was able to reproduce the deadlock with this script. When using the code in #3656 there the deadlock does not occur for me, so that seems promising but I will wait for @jgitlin-bt to report back on his long term testing.

Could you take a look at the pull request and see if it works for you as well?

jgitlin-bt commented 6 years ago

Apologies for the delay; I need to build a freebsd version and have been busy with other sprint work. I'll try and get that build tested early next week

adityask2 commented 6 years ago

@danielegozzi I tested with the fix in #3656 . However, the issue - took longer to collect than collection interval, is not fixed.

jgitlin-bt commented 6 years ago

@adityask2 please post a stack trace by sending telegraf a SIGQUIT. Note that "plugin took longer to collect than collection interval" seems to be a symptom of this issue, not the issue itself. Are you using an aggregator?

Disabling the aggregator resolved the issue for me so for (I've had >2 weeks with no missing datapoints)

I built the patched version from #3656 and re-enabled the aggregator; so far (less than 24 hours) so good, but I'll need more time to be 100% sure

jgitlin-bt commented 6 years ago

@epsylonix what's in telegraf_spammer.sh? I could use that to speed up my testing. (I did start to see this issue in November after doubling the number of StatsD data I was sending...)

epsylonix commented 6 years ago

@jgitlin-bt it's a simple bash script, the link to the gist is in my previous post.

It sends UDP datagrams using influx line protocol with some random data. When you run it with parallel several threads are used and the traffic increases. I don't know how much traffic telegraf is designed to handle, on my machine the script ran using parallel generates around 28000 rps. There is no throttling so it's basically uses all the processing power available to generate test data.

It seems that a lot of traffic is not the only condition for this deadlock, with 3 tags each having 5 distinct values it runs fine, but one additional tag leads to a deadlock within seconds. I haven't had time to build a branch with this fix yet, but I have a current stable build running on a test server, it doesn't process much data but it does stop reporting metrics occasionally with no errors in logs, very similar to the issue this script reproduces.

jgitlin-bt commented 6 years ago

Thanks! I missed the link in your post, I see it now. Tag cardinality may also be related; when I bumped up StatsD input data in November the new data had hundreds of distinct tag values for a new tag.

adityask2 commented 6 years ago

@jgitlin-bt I'm attaching the stack trace and conf file telegraf-dump.txt telegraf.txt

Can you please suggest the build you have used so that I can re-test in that.

jgitlin-bt commented 6 years ago

Interesting @adityask2 so you aren't using an Aggregator. I suspect you have a separate issue than I do, as for me the issue is clearly related to the [[aggregators.minmax]] (when I disabled it, I have no issues) and as such you should probably submit a new bug report.

Do you see this in your log file?

E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)

I am running Telegraf v1.4.4 and now also Telegraf v1.6.0~6560edfa (git: aggregator-loop-deadlock 6560edf)

adityask2 commented 6 years ago

@jgitlin-bt Yes. I'm seeing "took longer to collect than collection interval" in the log. There seems to multiple issue reported on this. However, I will file another.

jgitlin-bt commented 6 years ago

@adityask2 interesting, your issue has the same symptoms but unlike mine is not tied to an aggregator and not fixed by #3656 -- good luck!

@danielnelson I am pleased to report that I have had no issues using #3656 and that (or disabling the aggregator) appears to have resolved my issue. Thank you so much!

influxdata / telegraf