influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

[inputs.diskio] Telegraf spams logs with "Error gathering disk info: no such file or directory" messages when used with `device_tags` #15666

Closed piotrp closed 2 months ago

piotrp commented 2 months ago

Relevant telegraf.conf

[[inputs.diskio]]
  device_tags = ["DUMMY"]

Logs from Telegraf

root@f81de059f08e:/# telegraf --config test.conf --test --debug
2024-07-25T13:54:06Z I! Loading config: test.conf
2024-07-25T13:54:06Z I! Starting Telegraf 1.31.2 brought to you by InfluxData the makers of InfluxDB
2024-07-25T13:54:06Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-07-25T13:54:06Z I! Loaded inputs: diskio
2024-07-25T13:54:06Z I! Loaded aggregators:
2024-07-25T13:54:06Z I! Loaded processors:
2024-07-25T13:54:06Z I! Loaded secretstores:
2024-07-25T13:54:06Z W! Outputs are not used in testing mode!
2024-07-25T13:54:06Z I! Tags enabled: host=f81de059f08e
2024-07-25T13:54:06Z D! [agent] Initializing plugins
2024-07-25T13:54:06Z D! [agent] Starting service inputs
2024-07-25T13:54:06Z W! [inputs.diskio] Error gathering disk info: no such file or directory
2024-07-25T13:54:06Z W! [inputs.diskio] Error gathering disk info: no such file or directory
2024-07-25T13:54:06Z W! [inputs.diskio] Error gathering disk info: no such file or directory
2024-07-25T13:54:06Z W! [inputs.diskio] Error gathering disk info: no such file or directory
2024-07-25T13:54:06Z W! [inputs.diskio] Error gathering disk info: no such file or directory
2024-07-25T13:54:06Z W! [inputs.diskio] Error gathering disk info: no such file or directory
> diskio,host=f81de059f08e,name=sda io_time=80u,iops_in_progress=0u,merged_reads=374u,merged_writes=0u,read_bytes=10712064u,read_time=56u,reads=215u,weighted_io_time=56u,write_bytes=0u,write_time=0u,writes=0u 1721915646000000000
> diskio,host=f81de059f08e,name=sdb io_time=60u,iops_in_progress=0u,merged_reads=0u,merged_writes=37u,read_bytes=1212416u,read_time=11u,reads=51u,weighted_io_time=38u,write_bytes=196608u,write_time=22u,writes=13u 1721915646000000000
2024-07-25T13:54:06Z D! [agent] Stopping service inputs
2024-07-25T13:54:06Z D! [agent] Input channel closed
> diskio,host=f81de059f08e,name=sdc io_time=370u,iops_in_progress=0u,merged_reads=47u,merged_writes=46u,read_bytes=24339456u,read_time=300u,reads=468u,weighted_io_time=663u,write_bytes=339968u,write_time=345u,writes=37u 1721915646000000000
2024-07-25T13:54:06Z D! [agent] Stopped Successfully
> diskio,host=f81de059f08e,name=sdd io_time=6050u,iops_in_progress=0u,merged_reads=15078u,merged_writes=22143u,read_bytes=352351232u,read_time=2482u,reads=8317u,weighted_io_time=75138u,write_bytes=724312064u,write_time=70557u,writes=5304u 1721915646000000000
> diskio,host=f81de059f08e,name=loop0 io_time=20u,iops_in_progress=0u,merged_reads=0u,merged_writes=0u,read_bytes=1103872u,read_time=14u,reads=50u,weighted_io_time=14u,write_bytes=0u,write_time=0u,writes=0u 1721915646000000000
> diskio,host=f81de059f08e,name=loop1 io_time=1600u,iops_in_progress=0u,merged_reads=0u,merged_writes=0u,read_bytes=232229888u,read_time=1109u,reads=2857u,weighted_io_time=1109u,write_bytes=0u,write_time=0u,writes=0u 1721915646000000000

System info

Telegraf 1.31.2, in container.

Docker

No response

Steps to reproduce

Use provided config in unprivileged Docker container (or a container created with machinectl / systemd-nspawn). device_tags must be present in configuration.

Expected behavior

No error logged or a better error message that actually indicates what is missing.

I tried to deploy the same file to real environment and a test container, and noticed that logs in my container contain multiple warnings. Considering that device_tags works on a best-effort basis this warning probably shouldn't be logged.

Actual behavior

Multiple warnings logged on each collection interval:

W! [inputs.diskio] Error gathering disk info: no such file or directory

Additional info

Traces for loop0 device:

root@f81de059f08e:/# strace -f telegraf --config test.conf --test --debug |& grep loop0
[pid   457] read(4, "grep\0loop0\0", 1024) = 11
[pid   453] newfstatat(AT_FDCWD, "/dev/loop0", 0xc000198940, 0) = -1 ENOENT (No such file or directory)
[pid   453] newfstatat(AT_FDCWD, "/sys/block/loop0/dm/name", 0xc000f1a2a8, 0) = -1 ENOENT (No such file or directory)
[pid   453] newfstatat(AT_FDCWD, "/dev/loop0", 0xc0001989b0, 0) = -1 ENOENT (No such file or directory)
[pid   453] newfstatat(AT_FDCWD, "/dev/loop0",  <unfinished ...>
[pid   453] openat(AT_FDCWD, "/sys/block/loop0/wwid", O_RDONLY|O_CLOEXEC <unfinished ...>
> diskio,host=f81de059f08e,name=loop0 io_time=20u,iops_in_progress=0u,merged_reads=0u,merged_writes=0u,read_bytes=1103872u,read_time=14u,reads=50u,weighted_io_time=14u,write_bytes=0u,write_time=0u,writes=0u 1721916516000000000

All ENOENT messages reported by OS:

root@f81de059f08e:/# strace -f telegraf --config test.conf --test --debug |& grep ENOENT
[pid   410] newfstatat(AT_FDCWD, "/run/user/0/bus", 0xc001e78ac8, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/run/user/0/dbus-session", 0xc001e78b98, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/usr/local/sbin/dbus-launch", 0xc001e78d38, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/usr/local/bin/dbus-launch", 0xc001e78e08, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/usr/sbin/dbus-launch", 0xc001e78ed8, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/usr/bin/dbus-launch", 0xc001e78fa8, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/sbin/dbus-launch", 0xc001e79078, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/bin/dbus-launch", 0xc001e79148, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/run/user/0/bus", 0xc001e79218, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/run/user/0/dbus-session", 0xc001e792e8, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/usr/local/sbin/dbus-launch", 0xc001e793b8, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/usr/local/bin/dbus-launch", 0xc001e79488, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/usr/sbin/dbus-launch", 0xc001e79558, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/usr/bin/dbus-launch", 0xc001e79628, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/sbin/dbus-launch", 0xc001e796f8, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/bin/dbus-launch", 0xc001e797c8, 0) = -1 ENOENT (No such file or directory)
[pid   410] connect(3, {sa_family=AF_UNIX, sun_path="/dev/log"}, 11) = -1 ENOENT (No such file or directory)
[pid   410] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/syslog"}, 18) = -1 ENOENT (No such file or directory)
[pid   410] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/log"}, 15) = -1 ENOENT (No such file or directory)
[pid   410] connect(3, {sa_family=AF_UNIX, sun_path="/dev/log"}, 11) = -1 ENOENT (No such file or directory)
[pid   410] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/syslog"}, 18) = -1 ENOENT (No such file or directory)
[pid   410] connect(3, {sa_family=AF_UNIX, sun_path="/var/run/log"}, 15) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/proc/xen", 0xc0016cb3b8, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/proc/bc/0", 0xc0016cb6f8, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/proc/vz", 0xc0016cb7c8, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/usr/local/sbin/pidof", 0xc00211e2a8, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/usr/local/bin/pidof", 0xc00211e378, 0) = -1 ENOENT (No such file or directory)
[pid   410] newfstatat(AT_FDCWD, "/usr/sbin/pidof", 0xc00211e448, 0) = -1 ENOENT (No such file or directory)
[pid   422] <... access resumed>)       = -1 ENOENT (No such file or directory)
[pid   422] read(4, "grep\0ENOENT\0", 1024) = 12
[pid   410] <... newfstatat resumed>0xc00227e2a8, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/loop0", 0xc001b00940, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/sys/block/loop0/dm/name", 0xc001aec1d8, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/loop0", 0xc001b009b0, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/loop1", 0xc001b00940, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/sys/block/loop1/dm/name", 0xc001aec378, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/loop1", 0xc001b009b0, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/sda", 0xc001b00940, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/sys/block/sda/dm/name", 0xc001aec448, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/sda", 0xc001b009b0, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/sdb", 0xc001b00940, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/sys/block/sdb/dm/name", 0xc001aec518, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/sdb", 0xc001b009b0, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/sdc", 0xc001b00940, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/sys/block/sdc/dm/name", 0xc001aec5e8, 0) = -1 ENOENT (No such file or directory)
[pid   420] <... newfstatat resumed>0xc001b009b0, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/sdd", 0xc001b00940, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/sys/block/sdd/dm/name", 0xc001aec6b8, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/sdd", 0xc001b009b0, 0) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/loop0", 0xc001b00ab8, 0) = -1 ENOENT (No such file or directory)
[pid   420] openat(AT_FDCWD, "/sys/block/loop0/wwid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/loop0", 0xc001b00ae0, 0) = -1 ENOENT (No such file or directory)
[pid   420] <... newfstatat resumed>0xc001b00ab8, 0) = -1 ENOENT (No such file or directory)
[pid   420] <... openat resumed>)       = -1 ENOENT (No such file or directory)
[pid   420] <... newfstatat resumed>0xc001b00ae0, 0) = -1 ENOENT (No such file or directory)
[pid   420] <... newfstatat resumed>0xc001b00ab8, 0) = -1 ENOENT (No such file or directory)
[pid   420] <... openat resumed>)       = -1 ENOENT (No such file or directory)
[pid   420] <... newfstatat resumed>0xc001b00ae0, 0) = -1 ENOENT (No such file or directory)
[pid   420] <... newfstatat resumed>0xc001b00ab8, 0) = -1 ENOENT (No such file or directory)
[pid   420] <... openat resumed>)       = -1 ENOENT (No such file or directory)
[pid   420] newfstatat(AT_FDCWD, "/dev/sdb", 0xc001b00ae0, 0) = -1 ENOENT (No such file or directory)
[pid   420] <... newfstatat resumed>0xc001b00ab8, 0) = -1 ENOENT (No such file or directory)
[pid   420] <... openat resumed>)       = -1 ENOENT (No such file or directory)
[pid   420] <... newfstatat resumed>0xc001b00ae0, 0) = -1 ENOENT (No such file or directory)
[pid   420] <... newfstatat resumed>0xc001b00ab8, 0) = -1 ENOENT (No such file or directory)
[pid   420] <... openat resumed>)       = -1 ENOENT (No such file or directory)
[pid   420] <... newfstatat resumed>0xc001b00ae0, 0) = -1 ENOENT (No such file or directory)
powersj commented 2 months ago

You will note that you are getting ENOENT multiple times per device.

The plugin has two places where that error message can be printed, once before diskTags are collected and again after. In both cases, the exact same function call, d.diskInfo(devName) is called.

Because no named templates are defined in your config, the first warning is never printed and we immediately return the deviceName and links without any further processing. Then because you defined a device tag, we run the same call and immediately check the error and print the warning. My question is why we aren't checking the error the first time around as well.

I think we should, and that would actually mean more warnings in your logs, as now we would warn twice per device: once trying to get the disk name and again getting the disk tags.

powersj commented 2 months ago

I've put up https://github.com/influxdata/telegraf/pull/15667 which will only print the messages once and include some additional details about what file was being read. Artifacts that you can use will get added to the PR in 20-30mins by the tiger-bot. Please give those a shot and let me know.

piotrp commented 2 months ago

Works great, I get unique warnings for each device, with helpful details, and it isn't repeated on each metric collection.

2024-07-25T15:31:45Z I! Starting Telegraf 1.32.0-5ab72b53 brought to you by InfluxData the makers of InfluxDB
2024-07-25T15:31:45Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 62 outputs, 6 secret-stores
2024-07-25T15:31:45Z I! Loaded inputs: cpu disk diskio elasticsearch internal jolokia2_agent (2x) kernel linux_sysctl_fs logstash mem net nfsclient nstat processes swap system
2024-07-25T15:31:45Z I! Loaded aggregators:
2024-07-25T15:31:45Z I! Loaded processors: starlark (2x) strings
2024-07-25T15:31:45Z I! Loaded secretstores:
2024-07-25T15:31:45Z I! Loaded outputs: influxdb
2024-07-25T15:31:45Z I! Tags enabled: env=tools host=tools
2024-07-25T15:31:45Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"tools", Flush Interval:10s
2024-07-25T15:31:50Z W! [inputs.diskio] Unable to gather disk name for "loop0": error reading /dev/loop0: no such file or directory
2024-07-25T15:31:50Z W! [inputs.diskio] Unable to gather disk tags for "loop0": error reading /dev/loop0: no such file or directory
2024-07-25T15:31:50Z W! [inputs.diskio] Unable to gather disk name for "sda": error reading /dev/sda: no such file or directory
2024-07-25T15:31:50Z W! [inputs.diskio] Unable to gather disk tags for "sda": error reading /dev/sda: no such file or directory
2024-07-25T15:31:50Z W! [inputs.diskio] Unable to gather disk name for "sda1": error reading /dev/sda1: no such file or directory
2024-07-25T15:31:50Z W! [inputs.diskio] Unable to gather disk tags for "sda1": error reading /dev/sda1: no such file or directory
2024-07-25T15:31:50Z W! [inputs.diskio] Unable to gather disk name for "sda2": error reading /dev/sda2: no such file or directory
2024-07-25T15:31:50Z W! [inputs.diskio] Unable to gather disk tags for "sda2": error reading /dev/sda2: no such file or directory