Closed kaistierl closed 3 months ago
Nothing special.
Given the hostname of your logs, are you running in vagrant? Can you reproduce on a baremetal system and not in a vagrant machine?
1.29.5
Please try v1.31.1
Just try to monitor a systemd based service with telegraf like in the config given above.
This looks good to me:
[agent]
debug = true
omit_hostname = true
[[inputs.procstat]]
systemd_unit = "incus.service"
[[inputs.procstat]]
systemd_unit = "chronyd.service"
[[inputs.procstat]]
systemd_unit = "containerd.service"
[[outputs.file]]
$ ./telegraf --config config.toml --once
2024-07-31T16:24:46Z I! Loading config: config.toml
2024-07-31T16:24:46Z I! Starting Telegraf 1.32.0-094eff6a brought to you by InfluxData the makers of InfluxDB
2024-07-31T16:24:46Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 62 outputs, 6 secret-stores
2024-07-31T16:24:46Z I! Loaded inputs: procstat (3x)
2024-07-31T16:24:46Z I! Loaded aggregators:
2024-07-31T16:24:46Z I! Loaded processors:
2024-07-31T16:24:46Z I! Loaded secretstores:
2024-07-31T16:24:46Z I! Loaded outputs: file
2024-07-31T16:24:46Z I! Tags enabled:
2024-07-31T16:24:46Z D! [agent] Initializing plugins
2024-07-31T16:24:46Z D! [agent] Connecting outputs
2024-07-31T16:24:46Z D! [agent] Attempting connection to [outputs.file]
2024-07-31T16:24:46Z D! [agent] Successfully connected to outputs.file
2024-07-31T16:24:46Z D! [agent] Starting service inputs
2024-07-31T16:24:46Z D! [agent] Stopping service inputs
2024-07-31T16:24:46Z D! [agent] Input channel closed
2024-07-31T16:24:46Z I! [agent] Hang on, flushing any cached metrics before shutdown
procstat,process_name=containerd,systemd_unit=containerd.service voluntary_context_switches=108i,involuntary_context_switches=1i,minor_faults=2812i,child_minor_faults=6947i,num_threads=20i,major_faults=38i,child_major_faults=4i,cpu_time_system=1.61,pid=905i,ppid=1i,status="sleep",created_at=1722429983000000000i,cpu_time_user=2.1,memory_vms=2748436480i,memory_usage=0.07799071818590164,cpu_time_iowait=0,cpu_usage=0,memory_rss=52510720i,cmdline="/usr/bin/containerd",user="root" 1722443087000000000
procstat_lookup,pid_finder=pgrep,result=success,systemd_unit=containerd.service pid_count=1i,running=1i,result_code=0i 1722443087000000000
procstat,process_name=incusd,systemd_unit=incus.service memory_vms=7548514304i,voluntary_context_switches=99i,cpu_time_user=2.13,cpu_time_system=0.65,memory_rss=159252480i,involuntary_context_switches=2i,child_major_faults=30i,pid=1484i,cpu_time_iowait=0,cmdline="/usr/bin/incusd --group=incus-admin --logfile=/var/log/incus/incusd.log",ppid=1i,status="sleep",num_threads=30i,minor_faults=29518i,major_faults=10i,child_minor_faults=22705i,created_at=1722429988000000000i,cpu_usage=0,memory_usage=0.236527219414711,user="root" 1722443087000000000
procstat_lookup,pid_finder=pgrep,result=success,systemd_unit=incus.service pid_count=1i,running=1i,result_code=0i 1722443087000000000
procstat,process_name=chronyd,systemd_unit=chronyd.service cmdline="/usr/bin/chronyd",voluntary_context_switches=1652i,minor_faults=187i,cpu_time_system=0.1,cpu_usage=0,memory_rss=4112384i,involuntary_context_switches=3i,cpu_time_iowait=0,memory_vms=87539712i,memory_usage=0.006107853259891272,status="sleep",num_threads=1i,major_faults=3i,pid=857i,user="chrony",child_minor_faults=0i,child_major_faults=0i,created_at=1722429982000000000i,cpu_time_user=0.01,ppid=1i 1722443087000000000
procstat_lookup,pid_finder=pgrep,result=success,systemd_unit=chronyd.service running=1i,result_code=0i,pid_count=1i 1722443087000000000
2024-07-31T16:24:46Z D! [outputs.file] Wrote batch of 6 metrics in 74.47µs
2024-07-31T16:24:46Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2024-07-31T16:24:46Z I! [agent] Stopping running outputs
2024-07-31T16:24:46Z D! [agent] Stopped Successfully
Specifically:
procstat_lookup,pid_finder=pgrep,result=success,systemd_unit=containerd.service pid_count=1i,running=1i,result_code=0i 1722443087000000000
procstat_lookup,pid_finder=pgrep,result=success,systemd_unit=incus.service pid_count=1i,running=1i,result_code=0i 1722443087000000000
procstat_lookup,pid_finder=pgrep,result=success,systemd_unit=chronyd.service running=1i,result_code=0i,pid_count=1i 1722443087000000000
All have 1 running, 1 PID, and result code of 0, which is what I would expect.
I currently have no bare metal debian 12 machine at hand but I found out something interesting by trying it out like you did - i added [[outputs.file]]
to the config and as root then invoked telegraf by hand using /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d --debug -once
- on my terminal I could then see output lines of the procstat_lookup plugin with running=1
like this one:
procstat_lookup,host=vagrant-tos-core,pid_finder=pgrep,result=success,systemd_unit=apache2.service running=1i,result_code=0i,pid_count=1i 1722494248000000000
So generally, it seems to work. Interestingly, when I leave the config like this and then start telegraf via the systemd unit (I'm using the official debian package) it looks like this in the journal log - running=0
:
Aug 01 08:40:34 vagrant-tos-core telegraf[2056500]: procstat_lookup,host=vagrant-tos-core,pid_finder=pgrep,result=success,systemd_unit=apache2.service pid_count=1i,running=0i,result_code=0i 1722494430000000000
It turns out that when I remove User=telegraf
from the systemd unit it works again. So it must be some issue that only happens when telegraf is run with it's own user account. I'll try to investigate this further and see if I find any more hints what could cause this in detail...
It must have to do something with my system configuration. I spun up a fresh debian 12 instance and could not reproduce the issue there. Might be some hardening that interferes here. Will continue investigating and let you know what it was as soon as I found the root cause.
I got behind it! On my hardened machine the proc filesystem had the hidepid=2
option set which makes it impossible for unprivileged users to enumerate processes. This obviously breaks the procstat plugin.
Got it solved by removing this setting, thanks for your support!
Awesome, thanks for following up and letting us know the root cuase!
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.29.5, Debian 12, systemd 252 (252.26-1~deb12u2)
Docker
No response
Steps to reproduce
Nothing special. Just try to monitor a systemd based service with telegraf like in the config given above.
Expected behavior
The "running" field should be set to 1 for systemd units that are running
Actual behavior
The "running" field is always 0. Interestingly "pid_count" is correct - for instance for my apache2 service, it is 1 when the service is running and 0 if it was stopped.
Additional info
No response