Closed tubemeister closed 1 year ago
2023-09-18T10:14:05Z E! [agent] Error killing process: os: process already finished
This message is the timeout kicking in, however the plugin is getting stuck at the call to c.Wait()
. It looks like we do not attempt to clean up, terminate, or kill the grandchildren processes if they exist.
I have put up #13937 with a potential fix, it would be helpful if you could try running the artifacts in that issue, and see if it resolves in your other scenarios, beyond the usage of sleep
. I have only done some limited testing with this and scenarios on my amd64/linux system.
Thanks
That was quick. :-)
I've just tried it on the test case with the sleep command, and there it works fine now, timing out every 30 seconds.
Testing it in the real scenario is a bit tricky as that involves an unscheduled production outage the reasons for which aren't entirely understood yet... ;-)
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.28.1, Ubuntu 22.04 LTS
Docker
No response
Steps to reproduce
(In this test case the 'sleep' will eventually resolve itself without user interaction, but in production this is obviously not a 'sleep' but something else hanging indefinitely.)
Expected behavior
Expected behaviour is that the timeout kills the hanging process and reruns it on the next iteration.
Actual behavior
The timeout doesn't seem to kill the entire exec process, and said process is not rerun after the timeout.
Additional info
This might be related to #13913 but not sure so I've made a new report.
We have a ton of custom exec plugins for gathering metrics, what happened in this case is some service went out of memory and was shot down by OOMkiller, the metrics gathering process that connects to this service is now hanging indefinitely (>24h) and Telegraf does not recover from this without manual intervention. Running the plugin again would reconnect to the restarted service, but the exec plugin is skipped indefinitely.
In the test case this would obviously never work because it's a sleep but it clearly shows the test plugin is never run again even after the Telegraf log claims the process has been killed.
Checking 'ps aux |grep telegraf' before and after the timeout shows that the 'telegraf_test' process is killed but the sleep remains. The final line in the test plugin ('testlog.stop') never happens.
Other inputs are reporting data, it's just this exec plugin that isn't rerun. (Which makes it different from #13913)