influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

Telegraf fails stop sending metrics - took longer to collect than collection interval #3697

Closed adityask2 closed 6 years ago

adityask2 commented 6 years ago

Bug report

Telegraf seems to be choking after certain amount of time. I'm not using aggregators. Seems like there other issues resported on the same lines - https://github.com/influxdata/telegraf/issues/3629

Relevant telegraf.conf:

telegraf.txt

System info:

Telegraf v1.6.0~6e240567 (git: master 6e240567) Using it on Centos 7

Steps to reproduce:

  1. service start telegraf
  2. wait for some time (varies from system to system) for it reproduce.

Expected behavior:

Telegraf publishing metrics to mqtt.

Actual behavior:

Telegraf stops publishing metrics seemingly randomly, all input plugins start to fail with: 2018-01-18T16:10:00Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s) 2018-01-18T16:10:00Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s) 2018-01-18T16:10:00Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s) 2018-01-18T16:10:10Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s) 2018-01-18T16:10:10Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s) 2018-01-18T16:10:10Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s) 2018-01-18T16:10:20Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s) 2018-01-18T16:10:20Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s) 2018-01-18T16:10:20Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s) 2018-01-18T16:10:30Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s) 2018-01-18T16:10:30Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s) 2018-01-18T16:10:30Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s) 2018-01-18T16:10:40Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s) 2018-01-18T16:10:40Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s) 2018-01-18T16:10:40Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval

Additional info:

telegraf-dump.txt

danielnelson commented 6 years ago

This was caused by the mqtt output, I believe this issue I opened upstream is the cause: https://github.com/eclipse/paho.mqtt.golang/issues/185

I should be able to improve the use of timeouts in the output to avoid the hang.

adityask2 commented 6 years ago

Hi, I tested with timeout fix. Telegraf was able to send data successfully for 8 days. However, after 8 days 44 of machines choked. Attaching the telegraf conf. telegraf.conf.txt

danielnelson commented 6 years ago

Do you think you can try to get another stack dump by killing the process with SIGQUIT?

adityask2 commented 6 years ago

I'm running the process in background and had closed the session. Used kill -3 command. However not getting the stack dump. Is there any other way to get it?

danielnelson commented 6 years ago

I don't know of any other way, any log messages when it started or just the took longer to collect than collection interval?

adityask2 commented 6 years ago

Attaching telegraf log telegraf.log

Will see if I can get telegraf dump.

adityask2 commented 6 years ago

Does taking core dump through gcore be useful?

danielnelson commented 6 years ago

I haven't tried it, but according to this blog it should work.

adityask2 commented 6 years ago

Sharing the dump. telegraf_coredump_131.txt

adityask2 commented 6 years ago

Was able to catch the dump using kill -SIGABRT This dumps the stack Trace to journal

danielnelson commented 6 years ago

This stack is trickier than the last one, it doesn't look like we are blocked on a channel like before, I do see the MQTT library is trying to reconnect though.

There has been some fixes to the MQTT library, so I updated to the latest release, it is currently available in the nightly builds, do you think you could retry with this?

adityask2 commented 6 years ago

We are building custom telegraf. It would be very helpful if you can tell the commits which can be taken so that we can build it. I can retry with it.

danielnelson commented 6 years ago

Here is the commit, don't forget to run gdm restore

https://github.com/influxdata/telegraf/commit/5639d5608dd3d0c6f3c8932cc5b0a84ab34421e5

adityask2 commented 6 years ago

@danielnelson I tested it with latest changes. It didnt even sustain one day. Attaching the dump and conf file. telegraf-conf.txt telegraf-dump-1.txt

Used the changes you suggested.

bigonese commented 6 years ago

I get some of the same "took longer to collect" messages. We are also sending data upstream with MQTT. But I'm also getting what looks to be some sort of communication error...

2018-02-04T20:39:34Z E! Failed to connect to output mqtt, retrying in 15s, error was 'Network Error : dial tcp: lookup mqtt.redacted.com on [::1]:53: read udp [::1]:39369->[::1]:53: read: connection refused'
2018-02-04T20:38:08Z E! Network Error : dial tcp: lookup mqtt.redacted.com on [::1]:53: read udp [::1]:34352->[::1]:53: read: connection refused

@adityask2 I'm curious if I have the same issue as you... do you see things like this in the logs?

danielnelson commented 6 years ago

@adityask2 Thank you for doing this testing, the issue is in the paho libary which I am not very familiar with the details of but here is my analysis:

We are blocked sending a message, there is not currently a way in the MQTT library to timeout this send:

goroutine 2376 [chan send, 1286 minutes]:
    github.com/eclipse/paho%2emqtt%2egolang.(*client).Publish(0xc42019d880, 0xc42026c930, 0x61, 0xc420680001, 0x12b2d80, 0xc420682b20, 0x1346a20, 0xc4206bb4d0)
/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/client.go:515 +0x4f8
    github.com/influxdata/telegraf/plugins/outputs/mqtt.(*MQTT).publish(0xc4200f0000, 0xc42026c930, 0x61, 0xc42026f800, 0x175, 0x180, 0x0, 0x0)
/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/influxdata/telegraf/plugins/outputs/mqtt/mqtt.go:155 +0xe9

The channel in question is read in func outgoing(c *client), I'm assuming a network interruption has caused this function to return.

This should call reconnect, but is stuck on the c.workers.Wait() WaitGroup, I believe because keepalive has not returned yet:

goroutine 1878 [semacquire, 1337 minutes]:
sync.runtime_Semacquire(0xc42019dba8)
    /build/mts/release/sb-13687633/compcache/cayman_go/ob-6134445/linux64/lin64/src/runtime/sema.go:47 +0x34
sync.(*WaitGroup).Wait(0xc42019db9c)
    /build/mts/release/sb-13687633/compcache/cayman_go/ob-6134445/linux64/lin64/src/sync/waitgroup.go:131 +0x7a
github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost(0xc42019d880, 0x1d88d60, 0xc4200140b0)
    /build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/client.go:435 +0x7e

Can't sendPing because we can't acquire the lock in SetRead:

goroutine 14 [semacquire, 1356 minutes]:
sync.runtime_SemacquireMutex(0xc4202b7de4)
    /build/mts/release/sb-13687633/compcache/cayman_go/ob-6134445/linux64/lin64/src/runtime/sema.go:62 +0x34
sync.(*Mutex).Lock(0xc4202b7de0)
    /build/mts/release/sb-13687633/compcache/cayman_go/ob-6134445/linux64/lin64/src/sync/mutex.go:87 +0x9d
github.com/eclipse/paho%2emqtt%2egolang.(*timer).SetRead(0xc4202b7de0, 0x2540be401)
    /build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:126 +0x2d
github.com/eclipse/paho%2emqtt%2egolang.sendPing(0xc4202b7de0, 0xc4202b7e60, 0xc42019d880)
    /build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:155 +0x37
github.com/eclipse/paho%2emqtt%2egolang.keepalive(0xc42019d880)
    /build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:104 +0x54d
created by github.com/eclipse/paho%2emqtt%2egolang.(*client).Connect.func1

The lock is taken here in Reset(), and is waiting to read from t.C:

goroutine 27 [chan receive, 1356 minutes]:
    github.com/eclipse/paho%2emqtt%2egolang.(*timer).Reset(0xc4202b7de0, 0x6fc23ac00, 0x0)
/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:148 +0x113
    github.com/eclipse/paho%2emqtt%2egolang.keepalive.func4(0xc420306730, 0xc42019d880, 0xc420013500, 0xc4202b7de0)
/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:94 +0x1c5

These last two functions seem to violating this comment from the time.Timer:

// To prevent a timer created with NewTimer from firing after a call to Stop,
// check the return value and drain the channel.
// For example, assuming the program has not received from t.C already:
//
//  if !t.Stop() {
//      <-t.C
//  }
//
// This cannot be done concurrent to other receives from the Timer's
// channel.

So, the keepalive function in the paho library needs corrected to avoid deadlocking, but it looks like we can also skip this function by setting the right options. I will prepare a build with keepalive turned off.

adityask2 commented 6 years ago

Thanks @danielnelson. Very much appreciated. I will wait for the build.

danielnelson commented 6 years ago

It also appears that the keepalive functionality has been reworked upstream but is currently unreleased, and the comment on the latest commit makes me nervous about moving to a development version: https://github.com/eclipse/paho.mqtt.golang/commit/d896be27a59591819497d0ae3bde7db17b8a26e3#diff-1ce7d97ddd43423f168d3cad79898551

danielnelson commented 6 years ago

Can you retest with https://github.com/influxdata/telegraf/pull/3779/commits/2a718dfd70a97f82d1734500d5a81a4aa935a243?

adityask2 commented 6 years ago

Ok. To consolidate - I will be testing with 3 changes. Timeout fix, paho library changes, keep alive fix.

adityask2 commented 6 years ago

Keep alive interval - this is the amount of time after which a ping request is sent by server to client. If the connection between server and client breaks, server wouldn't know it. Is setting it 0 a good option? If not we can probably increase the timeout.

danielnelson commented 6 years ago

My reading of the client code was that the keep alive interval is the time for the client to ping the server, so it would help keep the client connected and potentially detect disconnections faster. Since we are sending data regularly I think even without it we should detect disconnection on the next send.

However, we have to turn it to zero if we want to disable the keepalive routine completely, and this is what looks to be the cause of the deadlocks. I don't think increasing the timeout would help. Another option we could try is the very latest revision in the repo, bab87cde322cf9aec420645d1e9801d0f216f791, it looks like they have rewritten this code so it may work now.

adityask2 commented 6 years ago

Running the telergaf build. Running successfully for 7 days. Planning to use this build for prod.

For the logging part - is log rotate enabled for linux and windows mode? Planning to run telegraf in quiet mode (need just error msg) in prod

danielnelson commented 6 years ago

I added the keep alive change for 1.5.3.

Regarding logging, the Linux package installs a logrotate config file to /etc/logrotate.d/telegraf, but in Windows there is no mechanism for log rotation as part of the package.

adityask2 commented 6 years ago

How to prevent log filling up disk space in Windows? Using quite mode.

danielnelson commented 6 years ago

I believe there are versions/clones of logrotate for Windows, but I haven't tried it. Maybe you can get some ideas from the InfluxData Community site.

Windows log rotation docs is issue https://github.com/influxdata/telegraf/issues/3393, if you come up with a good solution I would love to add it to the official documentation.