influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.67k stars 5.59k forks source link

procstat_lookup: "running" field always 0 on debian 12 #15698

Closed kaistierl closed 3 months ago

kaistierl commented 3 months ago

Relevant telegraf.conf

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  hostname = ""
  omit_hostname = false

[[outputs.influxdb]]
  urls = ["http://influxdb:8650"]
  database = "telegraf_myhost"

[[inputs.procstat]]
  systemd_unit = "elasticsearch.service"
[[inputs.procstat]]
  systemd_unit = "kibana.service"
[[inputs.procstat]]
  systemd_unit = "mariadb.service"
[[inputs.procstat]]
  systemd_unit = "apache2.service"

Logs from Telegraf

Jul 31 17:55:38 vagrant-tos-core systemd[1]: Starting telegraf.service - Telegraf...
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Loading config: /etc/telegraf/telegraf.conf
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Loading config: /etc/telegraf/telegraf.d/basic-plugins.conf
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Loading config: /etc/telegraf/telegraf.d/elasticsearch-monitoring.conf
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Loading config: /etc/telegraf/telegraf.d/mariadb-monitoring.conf
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Loading config: /etc/telegraf/telegraf.d/rabbitmq-monitoring.conf
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Loading config: /etc/telegraf/telegraf.d/service-monitoring.conf
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Starting Telegraf 1.29.5 brought to you by InfluxData the makers of InfluxDB
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 60 outputs, 6 secret-stores
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Loaded inputs: chrony cpu disk diskio elasticsearch kernel mem mysql net netstat processes procstat (43x) rabbitmq swap system
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Loaded aggregators:
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Loaded processors:
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Loaded secretstores:
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Loaded outputs: influxdb
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! Tags enabled: host=vagrant-tos-core
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"vagrant-tos-core", Flush Interval:10s
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z D! [agent] Initializing plugins
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z W! DeprecationWarning: Value "false" for option "ignore_protocol_stats" of plugin "inputs.net" deprecated since version 1.27.3 and will be removed in 1.36.0: use the 'inputs.nstat'>
Jul 31 17:55:41 vagrant-tos-core systemd[1]: Started telegraf.service - Telegraf.
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z D! [agent] Connecting outputs
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z D! [agent] Attempting connection to [outputs.influxdb]
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z D! [agent] Successfully connected to outputs.influxdb
Jul 31 17:55:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:41Z D! [agent] Starting service inputs
Jul 31 17:55:50 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0d7845b3d15f"): permission denied
Jul 31 17:55:50 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/d797917a7fb0"): permission denied
Jul 31 17:55:50 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/5efd30c6c09e"): permission denied
Jul 31 17:55:50 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0c1dc71534d0"): permission denied
Jul 31 17:55:50 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/boot/efi"): permission denied
Jul 31 17:55:51 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:51Z D! [outputs.influxdb] Wrote batch of 66 metrics in 11.266508ms
Jul 31 17:55:51 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:55:51Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.rabbitmq] Requesting "http://tos-rabbitmq-broker-mgmt:8620/api/nodes"...
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.rabbitmq] Requesting "http://tos-rabbitmq-broker-mgmt:8620/api/queues"...
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.rabbitmq] Requesting "http://tos-rabbitmq-broker-mgmt:8620/api/exchanges"...
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.rabbitmq] Requesting "http://tos-rabbitmq-broker-mgmt:8620/api/federation-links"...
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.rabbitmq] Requesting "http://tos-rabbitmq-broker-mgmt:8620/api/overview"...
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.rabbitmq] HTTP status code: 200 OK
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0d7845b3d15f"): permission denied
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/d797917a7fb0"): permission denied
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/5efd30c6c09e"): permission denied
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0c1dc71534d0"): permission denied
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/boot/efi"): permission denied
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.rabbitmq] HTTP status code: 404 Not Found
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z E! [inputs.rabbitmq] Error in plugin: getting "/api/federation-links" failed: 404 Not Found
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.rabbitmq] HTTP status code: 200 OK
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.rabbitmq] HTTP status code: 200 OK
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.rabbitmq] Requesting "http://tos-rabbitmq-broker-mgmt:8620/api/nodes/rabbit@vagrant-tos-core/memory"...
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.rabbitmq] HTTP status code: 200 OK
Jul 31 17:56:00 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:00Z D! [inputs.rabbitmq] HTTP status code: 200 OK
Jul 31 17:56:02 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:02Z D! [outputs.influxdb] Wrote batch of 927 metrics in 576.090728ms
Jul 31 17:56:02 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:02Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
Jul 31 17:56:10 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0d7845b3d15f"): permission denied
Jul 31 17:56:10 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/d797917a7fb0"): permission denied
Jul 31 17:56:10 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/5efd30c6c09e"): permission denied
Jul 31 17:56:10 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0c1dc71534d0"): permission denied
Jul 31 17:56:10 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/boot/efi"): permission denied
Jul 31 17:56:11 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:11Z D! [outputs.influxdb] Wrote batch of 75 metrics in 13.680321ms
Jul 31 17:56:02 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:02Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
Jul 31 17:56:10 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0d7845b3d15f"): permission denied
Jul 31 17:56:10 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/d797917a7fb0"): permission denied
Jul 31 17:56:10 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/5efd30c6c09e"): permission denied
Jul 31 17:56:10 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0c1dc71534d0"): permission denied
Jul 31 17:56:10 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/boot/efi"): permission denied
Jul 31 17:56:11 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:11Z D! [outputs.influxdb] Wrote batch of 75 metrics in 13.680321ms
Jul 31 17:56:11 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:11Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
Jul 31 17:56:20 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:20Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0d7845b3d15f"): permission denied
Jul 31 17:56:20 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:20Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/d797917a7fb0"): permission denied
Jul 31 17:56:20 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:20Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/5efd30c6c09e"): permission denied
Jul 31 17:56:20 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:20Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0c1dc71534d0"): permission denied
Jul 31 17:56:20 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:20Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/boot/efi"): permission denied
Jul 31 17:56:21 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:21Z D! [outputs.influxdb] Wrote batch of 75 metrics in 14.445491ms
Jul 31 17:56:21 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:21Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
Jul 31 17:56:30 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:30Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0d7845b3d15f"): permission denied
Jul 31 17:56:30 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:30Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/d797917a7fb0"): permission denied
Jul 31 17:56:30 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:30Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/5efd30c6c09e"): permission denied
Jul 31 17:56:30 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:30Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0c1dc71534d0"): permission denied
Jul 31 17:56:30 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:30Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/boot/efi"): permission denied
Jul 31 17:56:31 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:31Z D! [outputs.influxdb] Wrote batch of 75 metrics in 14.176389ms
Jul 31 17:56:31 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:31Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
Jul 31 17:56:40 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:40Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0d7845b3d15f"): permission denied
Jul 31 17:56:40 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:40Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/d797917a7fb0"): permission denied
Jul 31 17:56:40 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:40Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/5efd30c6c09e"): permission denied
Jul 31 17:56:40 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:40Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0c1dc71534d0"): permission denied
Jul 31 17:56:40 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:40Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/boot/efi"): permission denied
Jul 31 17:56:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:41Z D! [outputs.influxdb] Wrote batch of 75 metrics in 10.838343ms
Jul 31 17:56:41 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:41Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
Jul 31 17:56:50 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0d7845b3d15f"): permission denied
Jul 31 17:56:50 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/d797917a7fb0"): permission denied
Jul 31 17:56:50 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/5efd30c6c09e"): permission denied
Jul 31 17:56:50 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/run/docker/netns/0c1dc71534d0"): permission denied
Jul 31 17:56:50 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:50Z D! [inputs.disk] [SystemPS] => unable to get disk usage ("/boot/efi"): permission denied
Jul 31 17:56:51 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:51Z D! [outputs.influxdb] Wrote batch of 75 metrics in 10.111057ms
Jul 31 17:56:51 vagrant-tos-core telegraf[1729445]: 2024-07-31T15:56:51Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics

System info

Telegraf 1.29.5, Debian 12, systemd 252 (252.26-1~deb12u2)

Docker

No response

Steps to reproduce

Nothing special. Just try to monitor a systemd based service with telegraf like in the config given above.

Expected behavior

The "running" field should be set to 1 for systemd units that are running

Actual behavior

The "running" field is always 0. Interestingly "pid_count" is correct - for instance for my apache2 service, it is 1 when the service is running and 0 if it was stopped.

Additional info

No response

powersj commented 3 months ago

Nothing special.

Given the hostname of your logs, are you running in vagrant? Can you reproduce on a baremetal system and not in a vagrant machine?

1.29.5

Please try v1.31.1

Just try to monitor a systemd based service with telegraf like in the config given above.

This looks good to me:

[agent]
  debug = true
  omit_hostname = true

[[inputs.procstat]]
  systemd_unit = "incus.service"

[[inputs.procstat]]
  systemd_unit = "chronyd.service"

[[inputs.procstat]]
  systemd_unit = "containerd.service"

[[outputs.file]]
$ ./telegraf --config config.toml --once
2024-07-31T16:24:46Z I! Loading config: config.toml
2024-07-31T16:24:46Z I! Starting Telegraf 1.32.0-094eff6a brought to you by InfluxData the makers of InfluxDB
2024-07-31T16:24:46Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 62 outputs, 6 secret-stores
2024-07-31T16:24:46Z I! Loaded inputs: procstat (3x)
2024-07-31T16:24:46Z I! Loaded aggregators:
2024-07-31T16:24:46Z I! Loaded processors:
2024-07-31T16:24:46Z I! Loaded secretstores:
2024-07-31T16:24:46Z I! Loaded outputs: file
2024-07-31T16:24:46Z I! Tags enabled:
2024-07-31T16:24:46Z D! [agent] Initializing plugins
2024-07-31T16:24:46Z D! [agent] Connecting outputs
2024-07-31T16:24:46Z D! [agent] Attempting connection to [outputs.file]
2024-07-31T16:24:46Z D! [agent] Successfully connected to outputs.file
2024-07-31T16:24:46Z D! [agent] Starting service inputs
2024-07-31T16:24:46Z D! [agent] Stopping service inputs
2024-07-31T16:24:46Z D! [agent] Input channel closed
2024-07-31T16:24:46Z I! [agent] Hang on, flushing any cached metrics before shutdown
procstat,process_name=containerd,systemd_unit=containerd.service voluntary_context_switches=108i,involuntary_context_switches=1i,minor_faults=2812i,child_minor_faults=6947i,num_threads=20i,major_faults=38i,child_major_faults=4i,cpu_time_system=1.61,pid=905i,ppid=1i,status="sleep",created_at=1722429983000000000i,cpu_time_user=2.1,memory_vms=2748436480i,memory_usage=0.07799071818590164,cpu_time_iowait=0,cpu_usage=0,memory_rss=52510720i,cmdline="/usr/bin/containerd",user="root" 1722443087000000000
procstat_lookup,pid_finder=pgrep,result=success,systemd_unit=containerd.service pid_count=1i,running=1i,result_code=0i 1722443087000000000
procstat,process_name=incusd,systemd_unit=incus.service memory_vms=7548514304i,voluntary_context_switches=99i,cpu_time_user=2.13,cpu_time_system=0.65,memory_rss=159252480i,involuntary_context_switches=2i,child_major_faults=30i,pid=1484i,cpu_time_iowait=0,cmdline="/usr/bin/incusd --group=incus-admin --logfile=/var/log/incus/incusd.log",ppid=1i,status="sleep",num_threads=30i,minor_faults=29518i,major_faults=10i,child_minor_faults=22705i,created_at=1722429988000000000i,cpu_usage=0,memory_usage=0.236527219414711,user="root" 1722443087000000000
procstat_lookup,pid_finder=pgrep,result=success,systemd_unit=incus.service pid_count=1i,running=1i,result_code=0i 1722443087000000000
procstat,process_name=chronyd,systemd_unit=chronyd.service cmdline="/usr/bin/chronyd",voluntary_context_switches=1652i,minor_faults=187i,cpu_time_system=0.1,cpu_usage=0,memory_rss=4112384i,involuntary_context_switches=3i,cpu_time_iowait=0,memory_vms=87539712i,memory_usage=0.006107853259891272,status="sleep",num_threads=1i,major_faults=3i,pid=857i,user="chrony",child_minor_faults=0i,child_major_faults=0i,created_at=1722429982000000000i,cpu_time_user=0.01,ppid=1i 1722443087000000000
procstat_lookup,pid_finder=pgrep,result=success,systemd_unit=chronyd.service running=1i,result_code=0i,pid_count=1i 1722443087000000000
2024-07-31T16:24:46Z D! [outputs.file] Wrote batch of 6 metrics in 74.47µs
2024-07-31T16:24:46Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2024-07-31T16:24:46Z I! [agent] Stopping running outputs
2024-07-31T16:24:46Z D! [agent] Stopped Successfully

Specifically:

procstat_lookup,pid_finder=pgrep,result=success,systemd_unit=containerd.service pid_count=1i,running=1i,result_code=0i 1722443087000000000
procstat_lookup,pid_finder=pgrep,result=success,systemd_unit=incus.service pid_count=1i,running=1i,result_code=0i 1722443087000000000
procstat_lookup,pid_finder=pgrep,result=success,systemd_unit=chronyd.service running=1i,result_code=0i,pid_count=1i 1722443087000000000

All have 1 running, 1 PID, and result code of 0, which is what I would expect.

kaistierl commented 3 months ago

I currently have no bare metal debian 12 machine at hand but I found out something interesting by trying it out like you did - i added [[outputs.file]] to the config and as root then invoked telegraf by hand using /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d --debug -once - on my terminal I could then see output lines of the procstat_lookup plugin with running=1 like this one:

procstat_lookup,host=vagrant-tos-core,pid_finder=pgrep,result=success,systemd_unit=apache2.service running=1i,result_code=0i,pid_count=1i 1722494248000000000

So generally, it seems to work. Interestingly, when I leave the config like this and then start telegraf via the systemd unit (I'm using the official debian package) it looks like this in the journal log - running=0:

Aug 01 08:40:34 vagrant-tos-core telegraf[2056500]: procstat_lookup,host=vagrant-tos-core,pid_finder=pgrep,result=success,systemd_unit=apache2.service pid_count=1i,running=0i,result_code=0i 1722494430000000000

It turns out that when I remove User=telegraf from the systemd unit it works again. So it must be some issue that only happens when telegraf is run with it's own user account. I'll try to investigate this further and see if I find any more hints what could cause this in detail...

kaistierl commented 3 months ago

It must have to do something with my system configuration. I spun up a fresh debian 12 instance and could not reproduce the issue there. Might be some hardening that interferes here. Will continue investigating and let you know what it was as soon as I found the root cause.

kaistierl commented 3 months ago

I got behind it! On my hardened machine the proc filesystem had the hidepid=2 option set which makes it impossible for unprivileged users to enumerate processes. This obviously breaks the procstat plugin.

Got it solved by removing this setting, thanks for your support!

powersj commented 3 months ago

Awesome, thanks for following up and letting us know the root cuase!