Closed ajw1980 closed 5 months ago
execd process should always be restarted
From the code, we will continuously try to restart, except on errors from running the cmd start. If telegraf cannot start an input plugin, or in this case, start the execd that you want us to, then telegraf will fail. This is the expected behavior in general, as it makes little sense to try to continue run if we cannot start a plugin that you expect to provide data.
However, we have other FR to enable settings on a per-plugin basis that would allow ignoring errors on start up, and we can do that here as well.
Trying to reproduce the issue really gives me a headache. It seems like the startup only fails in cases where the OS hard-terminates the executed process. Such events are severe like out-of-memory or maybe segfaults. I don't think we should handle those cases as the kernel rightfully terminated the process, maybe even in an uncontrolled way (as in the OOM case).
Yeah, it would most likely be memory issues on the system to cause this situation. Maybe this should be an option to just shutdown telegraf with an error state if execd fails? This would at least make it more obvious that something failed and for systemd hosts the service would get restarted if configured to do so.
@ajw1980 please test the binary in PR #15271, available as-soon-as CI finished the tests, and let me know if this fixes your issue. The new option is called stop_on_error
and it needs to be set to true
for your use-case.
I downloaded it and added the config option. It seems like this only stops the execd plugin not telegraf itself, right?
Exactly. Once Telegraf started it is impossible to stop it completely as it is now.
@ajw1980 are you good with the fix?
That option doesn't really address this issue. The execd input will already not relaunch the command if there is some sort of system (out of memory) error.
@ajw1980,
Unfortunately, there is no way to kill Telegraf from a failure when calling the script. With @srebhan's PR we could kill the plugin, but not all of Telegraf.
We are left with taking that PR as an additional option or closing this as won't fix unless some other solution comes up.
Relevant telegraf.conf
Logs from Telegraf
System info
telegraf 1.21.3 fedora
Docker
No response
Steps to reproduce
Create an execd input. Have the system fail in a way that processes don't start properly (out of memory)
Expected behavior
execd process should always be restarted
Actual behavior
execd process was not restarted.
Additional info
In certain instances where a system problem causes processes to not start, an execd plugin process will not get restarted. In this case the machine ran out of memory and the execd process stopped. It would seem if the process starts and exits telegraf will restart it, but if telegraf fails to even start the process it will no longer be restarted.