Closed jgitlin-bt closed 6 years ago
Does the problem persist if you switch to using tcp ? I have seen this a few times but its pretty rare to cause an issue. I have used tcp to send to influx so just wanted to check.
Also, can you turn off the logparser input plugin and check if that causes any improvement ? The logparser uses some heavy regexes which burn considerable CPU.
Also, if you are sending lots of data in statsd, you should increase allow_pending_messages
to 10000 or so.
Just had another crash, same server, less than an hour after restarting Telegraf. logs
@agnivade thanks for the suggestions! As a last resort for debugging I can try TCP, but the overhead of TCP is seriously undesirable in our setup. We require a UDP based solution.
After the next crash, I'll disable logparse and see if that causes any improvement. (It's one of the most critical plugins for our usage so I hope that's not the problem!)
I'll increase allow_pending_messages
- I have been adjusting that to try and debug this issue with no noticeable effect
allowed_pending_messages
was 10,000. I just set allowed_pending_messages = 50000
to see if that makes any difference.
This may be related to metric_batch_size
... I have been playing with that value in an attempt to debug this, and the issue happens less frequently with larger values (E.G. a server with metric_batch_size = 2000
has worked for >7 days)
tcp should not be an overhead if you do batching properly. Yes, metric_batch_size
is certainly an important setting to tweak.
Keep us posted on your changes.
Sounds that for some reason the Accumulator is not being read from anymore, and its internal channel is full. In the past I have seen this caused by a stuck output, though I don't see this in your stack trace. I think increasing the pending messages will only delay this message from appearing after processing is stuck.
Can you try disabling the aggregator and see if it helps? Maybe I didn't actually solve #2914
@danielnelson I suspected the aggregator might be the issue also. I disabled it on our 3rd server, and haven't seen a problem on that server since. I just lost metrics from server1, and disabled it there also. here's the log and stack trace from that server
Luckily I am not actually using the aggregator, so I really should have it disabled anyway. (Turned it on because it seemed useful but when I started building dashboards I never actually needed it's output)
I am having this same issue, and have seen multiple related reports, but haven't found anything that has helped resolve it.
I haven't used any plugin other than inputs
but still facing this issue.
Please keep us posted if anyone has found a solution for it.
@sujeeth961991 so you're not using the aggregator plugin? If you post a stack trace when telegraf is stuck by sending it a SIGQUIT, does it look similar to mine?
I disabled the aggregator 5 days ago and haven't seen a single issue since, making me think that in my case that may be the issue.
It seems like we're experiencing this too, the aggregator plugin seems to be the problem, because there were no issues before I have enabled it. The telegraf process seems to be running but it doesn't report any metrics, no errors in logs though, after restart it started posting to influxdb again. I'll post a stacktrace next time it happens.
@jgitlin-bt I had restarted my telegraf agent. I will post a stacktrace next time when this ihappens.
I had the exact same problem today where telegraf died on one server over the holiday shutdown. I had to restart it. What type of debugging data should I collect next time it happens before restarting?
@sgreszcz Kill Telegraf with SIGQUIT and it should print a stack trace. Are you also using an aggregator?
Since disabling the aggregator I have had 6 days of no lockups, which is a record. Over the past 30 days I never made it more than 5 days without at least one server deadlocking. I'm going to start re-enabiling other inputs I disabled (like disk, network, etc) and see if things still look OK. Updates to follow.
@danielnelson I'm not using the aggregator, just collectd (socket listener), docker, internal. I have the same configuration on six servers collecting the same data (but from different devices) and only one of my telegraf instances has blown out so far with the "took longer to collect than collection interval (10s). It seems to be only the local inputs that are complaining, not the collectd forwarder using the socket_listener input.
Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.system]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.internal]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.processes]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.linux_sysctl_fs]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.diskio]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.kernel]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.docker]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.disk]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.swap]: took longer to collect than collection interval (10s) Jan 09 12:16:55 cdc-alln-001 telegraf[29068]: 2018-01-09T18:16:55Z E! Error in plugin [inputs.net]: took longer to collect than collection interval (10s)
Loaded inputs: inputs.linux_sysctl_fs inputs.disk inputs.diskio inputs.swap inputs.system inputs.docker inputs.net inputs.socket_listener inputs.cpu inputs.kernel inputs.mem inputs.processes inputs.internal
That's interesting @sgreszcz because for me, ever single input stops working, even the internal telegraf one when I disable that. The host stops publishing metrics altogether. We may have two separate issues, I am not sure. (Don't use collectd or docker, do use statsd input)
I wouldn't expect socket_listener
to complain since it is a "service input". This means that instead of being called each interval it is event driven: it adds new metrics when it receives on on its socket.
I suspect even though there is no log message that you cannot send items to the socket_listener and have them delivered to an output.
@jgitlin-bt @danielnelson
I should have been a bit clearer. Although socket_listener wasn't complaining in the logs, I wasn't getting any metrics of any type via that Telegraf process to my central InfluxDB server.
@danielnelson my telgraf config includes udp socket_listener, basicstats aggregator and influx output. when debug is enabled and telegraf hangs, I see it just just stops writing logs, metrics are not published either.
I tried to find how to reliably make it hang, this does it in a matter of seconds
script
first start telegraf with udp socket_listener, basicstats aggregator, output to influx
then run the script with parallel
parallel --no-notice -n0 -j0 ./telegraf_spammer.sh ::: {1..4}
By the way, for some reason when decreasing the cardinality to 3 tags with 5 different values each, telegraf works fine.
@epsylonix Thank you, I was able to reproduce the deadlock with this script. When using the code in #3656 there the deadlock does not occur for me, so that seems promising but I will wait for @jgitlin-bt to report back on his long term testing.
Could you take a look at the pull request and see if it works for you as well?
Apologies for the delay; I need to build a freebsd version and have been busy with other sprint work. I'll try and get that build tested early next week
@danielegozzi I tested with the fix in #3656 . However, the issue - took longer to collect than collection interval, is not fixed.
@adityask2 please post a stack trace by sending telegraf a SIGQUIT. Note that "plugin took longer to collect than collection interval" seems to be a symptom of this issue, not the issue itself. Are you using an aggregator?
Disabling the aggregator resolved the issue for me so for (I've had >2 weeks with no missing datapoints)
I built the patched version from #3656 and re-enabled the aggregator; so far (less than 24 hours) so good, but I'll need more time to be 100% sure
@epsylonix what's in telegraf_spammer.sh
? I could use that to speed up my testing. (I did start to see this issue in November after doubling the number of StatsD data I was sending...)
@jgitlin-bt it's a simple bash script, the link to the gist is in my previous post.
It sends UDP datagrams using influx line protocol with some random data. When you run it with parallel
several threads are used and the traffic increases. I don't know how much traffic telegraf is designed to handle, on my machine the script ran using parallel
generates around 28000 rps. There is no throttling so it's basically uses all the processing power available to generate test data.
It seems that a lot of traffic is not the only condition for this deadlock, with 3 tags each having 5 distinct values it runs fine, but one additional tag leads to a deadlock within seconds. I haven't had time to build a branch with this fix yet, but I have a current stable build running on a test server, it doesn't process much data but it does stop reporting metrics occasionally with no errors in logs, very similar to the issue this script reproduces.
Thanks! I missed the link in your post, I see it now. Tag cardinality may also be related; when I bumped up StatsD input data in November the new data had hundreds of distinct tag values for a new tag.
@jgitlin-bt I'm attaching the stack trace and conf file telegraf-dump.txt telegraf.txt
Can you please suggest the build you have used so that I can re-test in that.
Interesting @adityask2 so you aren't using an Aggregator. I suspect you have a separate issue than I do, as for me the issue is clearly related to the [[aggregators.minmax]]
(when I disabled it, I have no issues) and as such you should probably submit a new bug report.
Do you see this in your log file?
E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
I am running Telegraf v1.4.4
and now also Telegraf v1.6.0~6560edfa (git: aggregator-loop-deadlock 6560edf)
@jgitlin-bt Yes. I'm seeing "took longer to collect than collection interval" in the log. There seems to multiple issue reported on this. However, I will file another.
@adityask2 interesting, your issue has the same symptoms but unlike mine is not tied to an aggregator and not fixed by #3656 -- good luck!
@danielnelson I am pleased to report that I have had no issues using #3656 and that (or disabling the aggregator) appears to have resolved my issue. Thank you so much!
Bug report
After a seemingly random amount of time, Telegraf stops publishing metrics to InfluxDB over UDP. I have been experiencing this issue since Nov 2016 on both Telegraf 1.3.x and 1.4.4 on FreeBSD, on three separate servers. In the telegraf log, all collectors start to fail with
I can't see anything unusual or interesting published from the Telegraf internal metrics.
This same issue has been reported in #3318, #2183, #2919, #2780 and #2870 but all those issues are either abandoned by the requestor, or confused with several separate issues; I am pening a new issue for my specific problem but if it's a duplicate (#3318 seems to be the closest) then please feel free to close
Relevant telegraf.conf:
telegraf.conf
System info:
Telegraf v1.4.4 (git: unknown unknown)
running on FreeBSD 10.3-RELEASE-p24Steps to reproduce:
service telegraf start
$random
time periodExpected behavior:
Telegraf publishes metrics to InfluxDB server over UDP
Actual behavior:
Telegraf stops publishing metrics seemingly randomly, all input plugins start to fail with:
Additional info:
Full logs and stack trace
Earlier occurrence
Grafana snapshot of Internal Telegraf metrics