influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

feat(inputs.execd): allow failures on cmd start #14244

Closed ajw1980 closed 5 months ago

ajw1980 commented 11 months ago

Relevant telegraf.conf

[[inputs.execd]]
  command = ["/opt/adm/sbin/status.py"]
  signal = "none"
  restart_delay = "10s"
  data_format = "influx"

Logs from Telegraf

Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "Traceback (most recent call last):"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 248, in <module>"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    main()"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 240, in main"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    xrst.poll_status()"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 49, in poll_status"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    self.get_systemd_status()"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 163, in get_systemd_status"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    pipe = Popen(systemctl_cmd, stdout=PIPE, stderr=DEVNULL, universal_newlines=True)"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/usr/lib64/python3.6/subprocess.py\", line 709, in __init__"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    restore_signals, start_new_session)"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/usr/lib64/python3.6/subprocess.py\", line 1275, in _execute_child"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    restore_signals, start_new_session, preexec_fn)"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "OSError: [Errno 12] Cannot allocate memory"
Nov  3 09:38:43 machine telegraf[24194]: 2023-11-03T14:38:43Z E! [inputs.execd] Process /opt/adm/sbin/status.py exited: exit status 1
Nov  3 09:38:43 machine telegraf[24194]: 2023-11-03T14:38:43Z I! [inputs.execd] Restarting in 10s...
Nov  3 09:38:53 machine telegraf[24194]: 2023-11-03T14:38:53Z I! [inputs.execd] Starting process: /opt/adm/sbin/status.py []
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "Traceback (most recent call last):"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 248, in <module>"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    main()"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 240, in main"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    xrst.poll_status()"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 49, in poll_status"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    self.get_systemd_status()"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 163, in get_systemd_status"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    pipe = Popen(systemctl_cmd, stdout=PIPE, stderr=DEVNULL, universal_newlines=True)"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/usr/lib64/python3.6/subprocess.py\", line 709, in __init__"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    restore_signals, start_new_session)"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/usr/lib64/python3.6/subprocess.py\", line 1275, in _execute_child"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    restore_signals, start_new_session, preexec_fn)"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "OSError: [Errno 12] Cannot allocate memory"
Nov  3 09:39:43 machine telegraf[24194]: 2023-11-03T14:39:43Z E! [inputs.execd] Process /opt/adm/sbin/status.py exited: exit status 1
Nov  3 09:39:43 machine telegraf[24194]: 2023-11-03T14:39:43Z I! [inputs.execd] Restarting in 10s...
Nov  3 09:39:54 machine telegraf[24194]: 2023-11-03T14:39:53Z I! [inputs.execd] Starting process: /opt/adm/sbin/status.py []
Nov  3 09:39:54 machine telegraf[24194]: 2023-11-03T14:39:54Z E! [inputs.execd] Process quit with message: error starting process: fork/exec /opt/adm/sbin/status.py: cannot allocate memory

System info

telegraf 1.21.3 fedora

Docker

No response

Steps to reproduce

Create an execd input. Have the system fail in a way that processes don't start properly (out of memory)

Expected behavior

execd process should always be restarted

Actual behavior

execd process was not restarted.

Additional info

In certain instances where a system problem causes processes to not start, an execd plugin process will not get restarted. In this case the machine ran out of memory and the execd process stopped. It would seem if the process starts and exits telegraf will restart it, but if telegraf fails to even start the process it will no longer be restarted.

powersj commented 11 months ago

execd process should always be restarted

From the code, we will continuously try to restart, except on errors from running the cmd start. If telegraf cannot start an input plugin, or in this case, start the execd that you want us to, then telegraf will fail. This is the expected behavior in general, as it makes little sense to try to continue run if we cannot start a plugin that you expect to provide data.

However, we have other FR to enable settings on a per-plugin basis that would allow ignoring errors on start up, and we can do that here as well.

srebhan commented 6 months ago

Trying to reproduce the issue really gives me a headache. It seems like the startup only fails in cases where the OS hard-terminates the executed process. Such events are severe like out-of-memory or maybe segfaults. I don't think we should handle those cases as the kernel rightfully terminated the process, maybe even in an uncontrolled way (as in the OOM case).

ajw1980 commented 6 months ago

Yeah, it would most likely be memory issues on the system to cause this situation. Maybe this should be an option to just shutdown telegraf with an error state if execd fails? This would at least make it more obvious that something failed and for systemd hosts the service would get restarted if configured to do so.

srebhan commented 5 months ago

@ajw1980 please test the binary in PR #15271, available as-soon-as CI finished the tests, and let me know if this fixes your issue. The new option is called stop_on_error and it needs to be set to true for your use-case.

ajw1980 commented 5 months ago

I downloaded it and added the config option. It seems like this only stops the execd plugin not telegraf itself, right?

srebhan commented 5 months ago

Exactly. Once Telegraf started it is impossible to stop it completely as it is now.

srebhan commented 5 months ago

@ajw1980 are you good with the fix?

ajw1980 commented 5 months ago

That option doesn't really address this issue. The execd input will already not relaunch the command if there is some sort of system (out of memory) error.

powersj commented 5 months ago

@ajw1980,

Unfortunately, there is no way to kill Telegraf from a failure when calling the script. With @srebhan's PR we could kill the plugin, but not all of Telegraf.

We are left with taking that PR as an additional option or closing this as won't fix unless some other solution comes up.