influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.48k stars 5.55k forks source link

disk plugin reports incorrect per-mount mode for read only disk #6633

Open benschweizer opened 4 years ago

benschweizer commented 4 years ago

Relevant telegraf.conf:

# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default, telegraf gather stats for all mountpoints.
  ## Setting mountpoints will restrict the stats to the specified mountpoints.
  # mount_points = ["/"]

  ## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
  ## present on /run, /var/run, /dev/shm or /dev).
  ignore_fs = ["tmpfs", "devtmpfs", "devfs"]

System info:

20:45 root@ares:/tmp# cat /proc/mounts | grep ro,
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
/dev/mapper/vg1-backup /backup btrfs ro,relatime,compress=lzo,space_cache,subvolid=5,subvol=/ 0 0
/dev/loop7 /snap/core/7917 squashfs ro,nodev,relatime 0 0
/dev/loop11 /snap/lxd/12224 squashfs ro,nodev,relatime 0 0
/dev/loop12 /snap/lxd/12317 squashfs ro,nodev,relatime 0 0
/dev/loop13 /snap/core/8039 squashfs ro,nodev,relatime 0 0
20:45 root@ares:/tmp# /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d --test | grep ro,
2019-11-07T19:45:37Z I! Starting Telegraf 1.12.1
> disk,device=loop7,fstype=squashfs,host=ares.magnumchaos.org,mode=ro,path=/snap/core/7917 free=0i,inodes_free=0i,inodes_total=12829i,inodes_used=12829i,total=93454336i,used=93454336i,used_percent=100 1573155938000000000
> disk,device=loop11,fstype=squashfs,host=ares.magnumchaos.org,mode=ro,path=/snap/lxd/12224 free=0i,inodes_free=0i,inodes_total=1318i,inodes_used=1318i,total=57409536i,used=57409536i,used_percent=100 1573155938000000000
> disk,device=loop12,fstype=squashfs,host=ares.magnumchaos.org,mode=ro,path=/snap/lxd/12317 free=0i,inodes_free=0i,inodes_total=1325i,inodes_used=1325i,total=57540608i,used=57540608i,used_percent=100 1573155938000000000
> disk,device=loop13,fstype=squashfs,host=ares.magnumchaos.org,mode=ro,path=/snap/core/8039 free=0i,inodes_free=0i,inodes_total=12842i,inodes_used=12842i,total=93454336i,used=93454336i,used_percent=100 1573155938000000000

Steps to reproduce:

  1. break a filesystem
  2. check that /proc/mounts reports that it was remounted in readonly mode
  3. check the telegraf output, which says the same filesystem is still in readwrite mode - which is wrong

Expected behavior:

the disk plugin should reports the true value from /proc/mounts

Actual behavior:

the disk plugin relies on https://github.com/shirou/gopsutil/blob/master/disk/disk_linux.go which relies probably on /etc/mount or some other wrong data

Additional info:

danielnelson commented 4 years ago

I believe the source of this data is /proc/self/mountinfo, can you add the contents of it?

benschweizer commented 4 years ago
# cat mountinfo | grep backup
117 25 0:41 / /backup rw,relatime shared:65 - btrfs /dev/mapper/vg1-backup ro,compress=lzo,space_cache,subvolid=5,subvol=/
119 25 0:42 / /nobackup rw,relatime shared:69 - btrfs /dev/mapper/vg0-nobackup rw,compress=lzo,space_cache,subvolid=5,subvol=/

I see a difference in field 11 but not in field 6, but the kernel docs are not very elaborate there: https://www.kernel.org/doc/Documentation/filesystems/proc.txt

danielnelson commented 4 years ago
36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue
(1)(2)(3)   (4)   (5)      (6)      (7)   (8) (9)   (10)         (11)

(6) mount options:  per mount options
(11) super options:  per super block options

So it seems this is mount options vs super block options. In man proc it references man 2 mount for info on super block options.

From Linux 2.4 onward, some of the above flags are settable on a per- mount basis, while others apply to the superblock of the mounted filesystem, meaning that all mounts of the same filesystem share those flags. (Previously, all of the flags were per-superblock.)

Since Linux 2.6.16, MS_RDONLY can be set or cleared on a per-mount- point basis as well as on the underlying filesystem superblock. The mounted filesystem will be writable only if neither the filesystem nor the mountpoint are flagged as read-only.

So it seems to me that we should report mode=rw only if both the block and superblock are rw.

We may want this implemented in gopsutil, I'll open an issue there to discuss how best to handle this.

lassizci commented 4 years ago

Just upgraded to recent version and no longer getting correct metrics about filesystems remounted as read-only due to issues with disks. Gopsutil upstream doesn't seem terribly responsive regarding this.

rdxmb commented 4 years ago

Maybe the title of this issue can be changed because this is not a special btrfs-problem.

danielnelson commented 4 years ago

Can someone create a pull request for gopsutil exposing the superblock options? I'm not sure if they will accept the change but it would be helpful to provide them something concrete for review.

rdxmb commented 3 years ago

any updates on this? I am using this to monitor the ro-mounts after mount-problems, which is not working at the moment.

rdxmb commented 3 years ago

If somebody needs a workaround: The following check just counts the number of read-only mounts:

[[inputs.exec]]
  timeout = "5s"
  data_format = "influx"
  commands = ["bash -c '/bin/echo my_checks ro_mounts=$(mount | grep -c ro,)'"]
powersj commented 2 years ago

I have pinged upstream on the above PR, in hopes of getting a clear answer on whether they will take it or not.