influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.7k stars 5.59k forks source link

[[inputs.systemd_units]] no metric for inactive (dead) and disabled service #14763

Closed 1tft closed 8 months ago

1tft commented 9 months ago

Relevant telegraf.conf

/home/myuser/mytest.conf includes only

[[inputs.systemd_units]]
  pattern = "telegraf*"
  subcommand = "show"

Logs from Telegraf

nothing

System info

Rocky Linux 8.7 systemd v239 with latest Telegraf 1.30.0~03782eb70 build incl. PR https://github.com/influxdata/telegraf/pull/14539

Docker

No response

Steps to reproduce

inputs.systemd_units does not produce metric for a disabled systemd service which is inactive (dead).

Steps to reproduce: sudo systemctl disable telegraf.service

sudo systemctl status telegraf.service You should see a disabled service which is inactive (dead). ● telegraf.service - Telegraf Loaded: loaded (/usr/lib/systemd/system/telegraf.service; disabled; vendor preset: disabled) Active: inactive (dead) Docs: https://github.com/influxdata/telegraf

Run example config: sudo telegraf --config /home/myuser/mytest.conf --test --debug No metric is printed out. Thats not good and I think its a bug.

When you execute sudo systemctl enable telegraf.service and sudo telegraf --config /home/myuser/mytest.conf --test --debug Telegraf produces correct metric: > systemd_units,active=inactive,host=localhost.domain,load=loaded,name=telegraf.service,sub=dead,uf_preset=disabled,uf_state=enabled active_code=2i,load_code=0i,pid=0i,restarts=0i,status_errno=0i,sub_code=1i 1707768645000000000

Expected behavior

Telegraf (input.systemd_units) should create metric like: > systemd_units,active=inactive,host=localhost.domain,load=loaded,name=telegraf.service,sub=inactive,uf_preset=disabled,uf_state=disabled active_code=3i,load_code=0i,pid=0i,restarts=0i,status_errno=0i,sub_code=1i 1707768645000000000

Actual behavior

No metric is printed out for disabled service which is inactive (dead).

Additional info

Issue also exists with subcommand = "list-units" and with earlier telegraf versions. For a failed disabled (telegraf) service telegraf [[inputs.systemd_units]] config is working, example:

sudo systemctl disable telegraf.service

Make telegraf.conf invalid and restart telegraf service (systemctl restart telegraf) (Of course you can use any other systemd service and create same scenario.)

sudo systemctl status telegraf.service You should see a disabled service which is failed (Result: exit-code). ● telegraf.service - Telegraf Loaded: loaded (/usr/lib/systemd/system/telegraf.service;disabled; vendor preset: disabled) Active: failed (Result: exit-code). Docs: https://github.com/influxdata/telegraf Process: 1562 ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory=/etc/telegraf/telegraf.d/ Main PID: 1562 (code=exited, status=1/FAILURE)

sudo telegraf --config /home/myuser/mytest.conf --test --debug > systemd_units,active=failed,host=localhost.domain,load=loaded,name=telegraf.service,sub=failed,uf_preset=disabled,uf_state=disabled active_code=3i,load_code=0i,pid=0i,restarts=5i,status_errno=0i,sub_code=12i 1707768645000000000

1tft commented 9 months ago

@FlashSystems can you confirm this issue in Nuremberg too?

powersj commented 9 months ago

subcommand = "view"

This is not a valid subcommand. It is list-units or show.

subcommand = "list-units"

From the man page about list-units:

Note that this operation only displays runtime status, i.e. information about the current invocation of the unit (if it is running) or the most recent invocation (if it is not running anymore, and has not been released from memory).

You may not see this with list-units, but you should with show

pattern = "telegraf*"

fwiw I would have expected you to put telegraf without the wild-card, as that appears to fail given the name is exactly telegraf, so that may be an actual issue.

This is with pattern = "telegraf" and subcommand = "show"

2024-02-12T21:10:25Z I! Loading config: config.toml
2024-02-12T21:10:25Z I! Starting Telegraf 1.30.0-bb27c696 brought to you by InfluxData the makers of InfluxDB
2024-02-12T21:10:25Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 61 outputs, 6 secret-stores
2024-02-12T21:10:25Z I! Loaded inputs: systemd_units
2024-02-12T21:10:25Z I! Loaded aggregators: 
2024-02-12T21:10:25Z I! Loaded processors: 
2024-02-12T21:10:25Z I! Loaded secretstores: 
2024-02-12T21:10:25Z I! Loaded outputs: file
2024-02-12T21:10:25Z I! Tags enabled: host=j1
2024-02-12T21:10:25Z D! [agent] Initializing plugins
2024-02-12T21:10:25Z D! [agent] Connecting outputs
2024-02-12T21:10:25Z D! [agent] Attempting connection to [outputs.file]
2024-02-12T21:10:25Z D! [agent] Successfully connected to outputs.file
2024-02-12T21:10:25Z D! [agent] Starting service inputs
2024-02-12T21:10:25Z E! [inputs.systemd_units] Error in plugin: error 'strconv.Atoi: parsing "infinity": invalid syntax' parsing field 'MemoryAvailable'. Not an integer value
2024-02-12T21:10:25Z D! [agent] Stopping service inputs
2024-02-12T21:10:25Z D! [agent] Input channel closed
2024-02-12T21:10:25Z I! [agent] Hang on, flushing any cached metrics before shutdown
systemd_units,active=inactive,host=j1,load=loaded,name=telegraf.service,sub=dead,uf_preset=enabled,uf_state=disabled restarts=0i,load_code=0i,active_code=2i,sub_code=1i,pid=0i,status_errno=0i 1707772226000000000
2024-02-12T21:10:25Z D! [outputs.file] Wrote batch of 1 metrics in 28.49µs
2024-02-12T21:10:25Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2024-02-12T21:10:25Z I! [agent] Stopping running outputs
2024-02-12T21:10:25Z D! [agent] Stopped Successfully

This is with pattern = "telegraf*" and subcommand = "show"

2024-02-12T21:18:20Z I! Loading config: config.toml
2024-02-12T21:18:20Z I! Starting Telegraf 1.30.0-bb27c696 brought to you by InfluxData the makers of InfluxDB
2024-02-12T21:18:20Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 61 outputs, 6 secret-stores
2024-02-12T21:18:20Z I! Loaded inputs: systemd_units
2024-02-12T21:18:20Z I! Loaded aggregators: 
2024-02-12T21:18:20Z I! Loaded processors: 
2024-02-12T21:18:20Z I! Loaded secretstores: 
2024-02-12T21:18:20Z I! Loaded outputs: file
2024-02-12T21:18:20Z I! Tags enabled: host=j1
2024-02-12T21:18:20Z D! [agent] Initializing plugins
2024-02-12T21:18:20Z D! [agent] Connecting outputs
2024-02-12T21:18:20Z D! [agent] Attempting connection to [outputs.file]
2024-02-12T21:18:20Z D! [agent] Successfully connected to outputs.file
2024-02-12T21:18:20Z D! [agent] Starting service inputs
2024-02-12T21:18:20Z D! [agent] Stopping service inputs
2024-02-12T21:18:20Z D! [agent] Input channel closed
2024-02-12T21:18:20Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-02-12T21:18:20Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2024-02-12T21:18:20Z I! [agent] Stopping running outputs
2024-02-12T21:18:20Z D! [agent] Stopped Successfully

Finally, with pattern = "telegraf" and subcommand = "list-units", which I do not expect to work:

2024-02-12T21:10:25Z E! [telegraf] Error running agent: input plugins recorded 1 errors
root@j1:~# vim config.toml 
root@j1:~# ./telegraf --config config.toml --once
2024-02-12T21:10:48Z I! Loading config: config.toml
2024-02-12T21:10:48Z I! Starting Telegraf 1.30.0-bb27c696 brought to you by InfluxData the makers of InfluxDB
2024-02-12T21:10:48Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 61 outputs, 6 secret-stores
2024-02-12T21:10:48Z I! Loaded inputs: systemd_units
2024-02-12T21:10:48Z I! Loaded aggregators: 
2024-02-12T21:10:48Z I! Loaded processors: 
2024-02-12T21:10:48Z I! Loaded secretstores: 
2024-02-12T21:10:48Z I! Loaded outputs: file
2024-02-12T21:10:48Z I! Tags enabled: host=j1
2024-02-12T21:10:48Z D! [agent] Initializing plugins
2024-02-12T21:10:48Z D! [agent] Connecting outputs
2024-02-12T21:10:48Z D! [agent] Attempting connection to [outputs.file]
2024-02-12T21:10:48Z D! [agent] Successfully connected to outputs.file
2024-02-12T21:10:48Z D! [agent] Starting service inputs
2024-02-12T21:10:48Z D! [agent] Stopping service inputs
2024-02-12T21:10:48Z D! [agent] Input channel closed
2024-02-12T21:10:48Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-02-12T21:10:48Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2024-02-12T21:10:48Z I! [agent] Stopping running outputs
2024-02-12T21:10:48Z D! [agent] Stopped Successfully
1tft commented 9 months ago

Sorry, I mixed "show" and "view". Of course I used "show", using "view" as subcommand does not work at all (correct telegraf error). I have fixed it in my inital post now.

I can confirm your test results. Very interesting regarding using pattern with and without wildcard. Now it works for me also, with your suggested config without wildcard:

[[inputs.systemd_units]]
  pattern = "telegraf"
  subcommand = "show"

for a inactive (dead) and disabled service: > systemd_units,active=inactive,host=localhost.localdomain,load=loaded,name=telegraf.service,sub=dead,uf_preset=disabled,uf_state=disabled active_code=2i,load_code=0i,pid=0i,restarts=0i,status_errno=0i,sub_code=1i 1707808926000000000

But now you (and I) wonder, why does pattern = telegraf* is not working for inactive (dead) and disabled service but it is working for the same service when it is failed and disabled.

[[inputs.systemd_units]]
  pattern = "telegraf*"
  subcommand = "show"

systemctl status telegraf ● telegraf.service - Telegraf Loaded: loaded (/usr/lib/systemd/system/telegraf.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2024-02-13 08:25:56 CET; 2s ago Docs: https://github.com/influxdata/telegraf Process: 1409 ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS (code=exited, status=1/FAILURE) Main PID: 1409 (code=exited, status=1/FAILURE)

sudo telegraf --config /home/myuser/mytest.conf --test --debug 2024-02-13T07:26:09Z I! Loading config: /home/myuser/mytest.conf 2024-02-13T07:26:09Z I! Starting Telegraf 1.30.0-3782eb70 brought to you by InfluxData the makers of InfluxDB 2024-02-13T07:26:09Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 61 outputs, 6 secret-stores 2024-02-13T07:26:09Z I! Loaded inputs: systemd_units 2024-02-13T07:26:09Z I! Loaded aggregators: 2024-02-13T07:26:09Z I! Loaded processors: 2024-02-13T07:26:09Z I! Loaded secretstores: 2024-02-13T07:26:09Z W! Outputs are not used in testing mode! 2024-02-13T07:26:09Z I! Tags enabled: host=localhost.localdomain 2024-02-13T07:26:09Z D! [agent] Initializing plugins 2024-02-13T07:26:09Z D! [agent] Starting service inputs 2024-02-13T07:26:09Z D! [agent] Stopping service inputs 2024-02-13T07:26:09Z D! [agent] Input channel closed 2024-02-13T07:26:09Z D! [agent] Stopped Successfully > systemd_units,active=failed,host=localhost.localdomain,load=loaded,name=telegraf.service,sub=failed,uf_preset=disabled,uf_state=disabled active_code=3i,load_code=0i,pid=0i,restarts=5i,status_errno=0i,sub_code=12i 1707809169000000000

FlashSystems commented 9 months ago

I did a quick test and can reproduce your results with systemd 225. If a unit is in the state loaded, disabled, inactive(dead) it is only shown if its name is explicitly given on the systemctl command line.

This is independent of telegraf but seems to be a quirk of systemd:

# systemctl status test.service
○ test.service - Test service
     Loaded: loaded (/etc/systemd/system/test.service; disabled; preset: disabled)
     Active: inactive (dead)

These are the command lines used by the systemd_units plugin and the corresponding systemctl output:

# systemctl show --all --type=service --property Id 'test'|grep "test.service"
Id=test.service

# systemctl show --all --type=service --property Id 'test*'|grep "test.service"

# systemctl show --all --type=service --property Id '*'|grep "test.service"

I found no mention of this behaviour in the systemd man-page or the issue tracker.

1tft commented 9 months ago

@FlashSystems thank you for this important information.

I can also confirm with systemd v239 that systemctl status testservice* does not produce results for an inactive (dead) and disabled service BUT it does produce results for failed and disabled service.

We should mention this strange behaviour at least in inputs.systemd_unit README.md, when we dont found a "workaround".

FlashSystems commented 9 months ago

@powersj: Should this be mentioned within the readme? Systemd is a fast moving target and the behaviour can change at any time. The readme already states the command that is used for gathering information. Anybody should be able to do the tests I did. Maybe I should add a "troubleshooting" section that suggests trying out the bare systemctl command to check if it's a systemd or a telegraf problem. What do you think?

PS: I mistyped the systemd version. I tested 255 (the current version).

powersj commented 9 months ago

@FlashSystems thanks for tracking the root cause down. I would like to add something to the README as the pattern option makes it seem that you can put any pattern in there and it should work. A troubleshooting section with this gottcha + how users can use the CLI to verify things would be a wonderful addition.

Thanks!

knollet commented 9 months ago

@powersj: Should this be mentioned within the readme? Systemd is a fast moving target and the behaviour can change at any

Actually, it is not. There is documented API stability for it^1. Perhaps querying should, in general, not be done by calling systemctl, but by dbus, which is

  1. promised to be stable
  2. might be better equipped to return structured data (rather than requesting a csv with uncertain quoting and quickly and dirtily parsed)
  3. Might be better configurable regarding permissions.
FlashSystems commented 9 months ago

@knollet: I agree with you. Using the DBus API would be a very nice solution. Especially because this plugin is only doing simple pattern matching for filtering. This would also get rid of spawning an additional process for each collection cycle.

But for now I'll extend the documentation as @powersj suggested as soon as my time permits.

Regarding the stability promise: My experience with using systemd for years in big setups is, that the stability promise only covers the documented interfaces. There are enough bugs in systemd that you have to do a workaround from time to time. And fixing bugs that lead to a behaviour change is not covered by the stability promise.

srebhan commented 9 months ago

Already on it...

srebhan commented 9 months ago

Can please everybody test the binary in #14814 please and provide feedback!?! Please also check multi-instance units (the ones with a @ in the name)!

Please also verify that starting/stopping, enabling/disabling units as well as creating new units while Telegraf is running do work as expected!

TheNemesis584 commented 5 months ago

I can confirm that systemd_units doesn't send any metrics if service is stopped and it's located in /lib/systemd, while from /etc/systemd/system works fine

srebhan commented 5 months ago

@TheNemesis584 if you do see a problem, please open a new issue with your (redacted) configuration!