influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

Hanging exec plugin blocks metrics gathering #13935

Closed tubemeister closed 1 year ago

tubemeister commented 1 year ago

Relevant telegraf.conf

[[inputs.exec]]
    commands = [ "/usr/local/sbin/telegraf_test" ]
    timeout = "30s"

/usr/local/sbin/telegraf_test:
#!/bin/bash
date  >> /tmp/testlog.start;
sleep 10000;
date  >> /tmp/testlog.stop;

Logs from Telegraf

==> /var/log/telegraf/telegraf.log <==
2023-09-18T10:13:28Z D! [outputs.amqp] Connecting to "amqp://..."
2023-09-18T10:13:28Z D! [outputs.amqp] Connected to "amqp://..."
2023-09-18T10:13:28Z D! [agent] Successfully connected to outputs.amqp
2023-09-18T10:13:28Z D! [agent] Starting service inputs
2023-09-18T10:13:38Z D! [outputs.amqp] Wrote batch of 19 metrics in 544.01µs
2023-09-18T10:13:38Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
2023-09-18T10:13:40Z D! [inputs.exec] Previous collection has not completed; scheduled collection skipped
2023-09-18T10:13:40Z W! [inputs.exec] Collection took longer than expected; not complete after interval of 10s
2023-09-18T10:13:48Z D! [outputs.amqp] Wrote batch of 28 metrics in 723.617µs
2023-09-18T10:13:48Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
2023-09-18T10:13:50Z W! [inputs.exec] Collection took longer than expected; not complete after interval of 10s
2023-09-18T10:13:50Z D! [inputs.exec] Previous collection has not completed; scheduled collection skipped
2023-09-18T10:13:58Z D! [outputs.amqp] Wrote batch of 28 metrics in 489.838µs
2023-09-18T10:13:58Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
2023-09-18T10:14:00Z D! [inputs.exec] Previous collection has not completed; scheduled collection skipped
2023-09-18T10:14:00Z W! [inputs.exec] Collection took longer than expected; not complete after interval of 10s
2023-09-18T10:14:05Z E! [agent] Error killing process: os: process already finished
2023-09-18T10:14:08Z D! [outputs.amqp] Wrote batch of 28 metrics in 753.607µs
2023-09-18T10:14:08Z D! [outputs.amqp] Buffer fullness: 0 / 10000 metrics
2023-09-18T10:14:10Z D! [inputs.exec] Previous collection has not completed; scheduled collection skipped
2023-09-18T10:14:10Z W! [inputs.exec] Collection took longer than expected; not complete after interval of 10s
...
2023-09-18T10:35:30Z D! [inputs.exec] Previous collection has not completed; scheduled collection skipped
2023-09-18T10:35:30Z W! [inputs.exec] Collection took longer than expected; not complete after interval of 10s

System info

Telegraf 1.28.1, Ubuntu 22.04 LTS

Docker

No response

Steps to reproduce

  1. Restart Telegraf with above exec plugin
  2. Tail -f /tmp/testlog.start
  3. testlog shows the exec plugin is run one time only
  4. After 30s the Telegraf timeout triggers a process kill
  5. The base process is killed but the subprocess (in this case 'sleep') isn't killed
  6. Exec process is still not executing as shown in testlog.start
  7. Telegraf needs a hard restart to resolve the situation

(In this test case the 'sleep' will eventually resolve itself without user interaction, but in production this is obviously not a 'sleep' but something else hanging indefinitely.)

Expected behavior

Expected behaviour is that the timeout kills the hanging process and reruns it on the next iteration.

Actual behavior

The timeout doesn't seem to kill the entire exec process, and said process is not rerun after the timeout.

Additional info

This might be related to #13913 but not sure so I've made a new report.

We have a ton of custom exec plugins for gathering metrics, what happened in this case is some service went out of memory and was shot down by OOMkiller, the metrics gathering process that connects to this service is now hanging indefinitely (>24h) and Telegraf does not recover from this without manual intervention. Running the plugin again would reconnect to the restarted service, but the exec plugin is skipped indefinitely.

In the test case this would obviously never work because it's a sleep but it clearly shows the test plugin is never run again even after the Telegraf log claims the process has been killed.

Checking 'ps aux |grep telegraf' before and after the timeout shows that the 'telegraf_test' process is killed but the sleep remains. The final line in the test plugin ('testlog.stop') never happens.

Other inputs are reporting data, it's just this exec plugin that isn't rerun. (Which makes it different from #13913)

powersj commented 1 year ago

2023-09-18T10:14:05Z E! [agent] Error killing process: os: process already finished

This message is the timeout kicking in, however the plugin is getting stuck at the call to c.Wait(). It looks like we do not attempt to clean up, terminate, or kill the grandchildren processes if they exist.

I have put up #13937 with a potential fix, it would be helpful if you could try running the artifacts in that issue, and see if it resolves in your other scenarios, beyond the usage of sleep. I have only done some limited testing with this and scenarios on my amd64/linux system.

Thanks

tubemeister commented 1 year ago

That was quick. :-)

I've just tried it on the test case with the sleep command, and there it works fine now, timing out every 30 seconds.

Testing it in the real scenario is a bit tricky as that involves an unscheduled production outage the reasons for which aren't entirely understood yet... ;-)