influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.52k stars 5.56k forks source link

Some disk usage reported incorrectly (Docker) #10915

Closed danepowell closed 10 months ago

danepowell commented 2 years ago

Relevant telegraf.conf

[agent]
  flush_interval = "15s"
  interval = "15s"
  hostname="daenerys"

[[inputs.cpu]]
  totalcpu = true
  report_active = true

[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]

  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs", "nsfs"]

[[inputs.temp]]

[[outputs.influxdb_v2]]
  organization = "daenerys"
  bucket = "tig"
  urls = [ "http://influxdb:8086" ]
  token = "$INFLUX_TOKEN"

Logs from Telegraf

telegraf --input-filter=disk --test --debug
2022-03-29T20:38:24Z I! Using config file: /etc/telegraf/telegraf.conf
2022-03-29T20:38:24Z I! Starting Telegraf 1.22.0
2022-03-29T20:38:24Z I! Loaded inputs: disk
2022-03-29T20:38:24Z I! Loaded aggregators:
2022-03-29T20:38:24Z I! Loaded processors:
2022-03-29T20:38:24Z W! Outputs are not used in testing mode!
2022-03-29T20:38:24Z I! Tags enabled: host=daenerys
2022-03-29T20:38:24Z D! [agent] Initializing plugins
2022-03-29T20:38:24Z D! [agent] Starting service inputs
2022-03-29T20:38:24Z D! [agent] Stopping service inputs
2022-03-29T20:38:24Z D! [agent] Input channel closed
2022-03-29T20:38:24Z D! [agent] Stopped Successfully
> disk,device=mapper/ubuntu--vg-ubuntu--lv,fstype=ext4,host=daenerys,mode=rw,path=/ free=0i,inodes_free=0i,inodes_total=10849i,inodes_used=10849i,total=58327040i,used=58327040i,used_percent=100 1648586305000000000
> disk,device=nvme0n1p2,fstype=ext4,host=daenerys,mode=rw,path=/boot free=0i,inodes_free=0i,inodes_total=10849i,inodes_used=10849i,total=58327040i,used=58327040i,used_percent=100 1648586305000000000
> disk,device=nvme0n1p1,fstype=vfat,host=daenerys,mode=rw,path=/boot/efi free=0i,inodes_free=0i,inodes_total=10849i,inodes_used=10849i,total=58327040i,used=58327040i,used_percent=100 1648586305000000000
> disk,device=sda1,fstype=ext4,host=daenerys,mode=rw,path=/home/dane/media free=1475247816704i,inodes_free=244183638i,inodes_total=244187136i,inodes_used=3498i,total=3936862621696i,used=2261561303040i,used_percent=60.5211888156314 1648586305000000000

System info

Telegraf 1.22.0 Docker 20.10.12 Ubuntu 20.04.4

Docker

 telegraf:
    image: telegraf:latest
    container_name: telegraf
    volumes:
    - ${CONFIGFOLDER}/telegraf/telegraf.conf:/etc/telegraf/telegraf.conf
    - /:/hostfs:ro
    networks:
    - tig
    env_file: .telegraf.env
    environment:
      HOST_ETC: /hostfs/etc
      HOST_PROC: /hostfs/proc
      HOST_MOUNT_PREFIX: /hostfs
    restart: ${RESTARTPOLICY}

Steps to reproduce

I have Telegraf running via Docker on an Ubuntu host. The host is running on an NVMe drive mounted at / and also has a SATA drive mounted at /home/dane/media.

Run telegraf --input-filter=disk --test --debug in the container.

Expected behavior

See correct disk usage for both disks.

Actual behavior

Usage on the SATA drive is reported correctly, but the root drive is incorrectly shown as 100% full.

Additional info

No response

powersj commented 2 years ago

Hi,

Can you provide the output of lsblk and df -h to show us what that drive really should show?

Thanks

danepowell commented 2 years ago

Sure, thanks. I'll remove irrelevant volumes (loop / snap, etc)

$ lsblk
NAME                      MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                         8:0    0   3.7T  0 disk
└─sda1                      8:1    0   3.7T  0 part /home/dane/media
nvme0n1                   259:0    0   477G  0 disk
├─nvme0n1p1               259:1    0   512M  0 part /boot/efi
├─nvme0n1p2               259:2    0     1G  0 part /boot
└─nvme0n1p3               259:3    0 475.4G  0 part
  └─ubuntu--vg-ubuntu--lv 253:0    0 475.4G  0 lvm  /
$ df -h
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv  467G   54G  390G  13% /
/dev/nvme0n1p2                     976M  208M  702M  23% /boot
/dev/nvme0n1p1                     511M  5.3M  506M   2% /boot/efi
/dev/sda1                          3.6T  2.1T  1.4T  61% /home/dane/media
powersj commented 2 years ago

Great thanks!

If you run Telegraf directly on your host, do you get the same result showing a full /, /boot, and /boot/efi or is it only full in the container, or both?

What command are you using to launch the container? When you run the container, are you mounting any of the host filesystems into the container beyond what is there by default?

Thanks again

danepowell commented 2 years ago

Running from the host, usage is calculated correctly. It looks like it also parses the root device name slightly differently (dm-0 instead of mapper/ubuntu--vg-ubuntu--lv.

$ TELEGRAF_CONFIG_PATH=./telegraf.conf ./usr/bin/telegraf --input-filter=disk --test --debug
2022-03-30T14:49:45Z I! Using config file: ./telegraf.conf
2022-03-30T14:49:45Z I! Starting Telegraf 1.22.0
2022-03-30T14:49:45Z I! Loaded inputs: disk
2022-03-30T14:49:45Z I! Loaded aggregators:
2022-03-30T14:49:45Z I! Loaded processors:
2022-03-30T14:49:45Z W! Outputs are not used in testing mode!
2022-03-30T14:49:45Z I! Tags enabled: host=daenerys
2022-03-30T14:49:45Z D! [agent] Initializing plugins
2022-03-30T14:49:45Z D! [agent] Starting service inputs
2022-03-30T14:49:45Z D! [agent] Stopping service inputs
2022-03-30T14:49:45Z D! [agent] Input channel closed
2022-03-30T14:49:45Z D! [agent] Stopped Successfully
> disk,device=dm-0,fstype=ext4,host=daenerys,mode=rw,path=/ free=419502268416i,inodes_free=30477627i,inodes_total=31162368i,inodes_used=684741i,total=501405966336i,used=56362274816i,used_percent=11.844184572608826 1648651785000000000
> disk,device=nvme0n1p2,fstype=ext4,host=daenerys,mode=rw,path=/boot free=735690752i,inodes_free=65228i,inodes_total=65536i,inodes_used=308i,total=1023303680i,used=217149440i,used_percent=22.789701969246906 1648651785000000000
> disk,device=nvme0n1p1,fstype=vfat,host=daenerys,mode=rw,path=/boot/efi free=530321408i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=535805952i,used=5484544i,used_percent=1.0236063969666391 1648651785000000000
> disk,device=sda1,fstype=ext4,host=daenerys,mode=rw,path=/home/dane/media free=1477654212608i,inodes_free=244183638i,inodes_total=244187136i,inodes_used=3498i,total=3936862621696i,used=2259154907136i,used_percent=60.45679173708422 1648651785000000000

I'm running everything via docker-compose up -d, so what you see in the Docker compose file in the OP is exactly what's being mounted (just the root FS, /).

Inspecting /proc/mounts on the container vs host, it's way more verbose inside the container. There's all sorts of extra mounts I wouldn't expect to see if Docker was just dumbly mounting into /hostfs:

$ docker exec -it telegraf cat /hostfs/proc/mounts
<snip, what is this junk?>
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/lib/modules ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/lib/firmware ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/usr/src ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/var/log ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/media ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/mnt ext4 rw,relatime 0 0
powersj commented 2 years ago

Running from the host, usage is calculated correctly.

Great, that is good to know that it works outside the container.

It looks like it also parses the root device name slightly differently (dm-0 instead of mapper/ubuntu--vg-ubuntu--lv.

This is odd. It looks like gopsutil reads this from /sys:

$ cat /sys/block/dm-0/dm/name 
ubuntu--vg-ubuntu--lv

I'm running everything via docker-compose up -d, so something is going on when reviewing the disks in the container.

Thanks for confirming that. I modified it slightly (i.e. networks, restart, and env_file) and ran it with my Ubuntu VM that is also using LVM and got the results that I think I expected:

telegraf    | 2022-03-30T16:37:46Z I! Starting Telegraf 1.22.0
<snip>
telegraf    | disk,device=mapper/ubuntu--vg-ubuntu--lv,fstype=ext4,host=52fdbb17d4cf,mode=rw,path=/ used_percent=51.94518664157201,inodes_total=770048i,inodes_free=676203i,inodes_used=93845i,total=12347138048i,free=5622292480i,used=6077456384i 1648658270000000000
telegraf    | disk,device=vda2,fstype=ext4,host=52fdbb17d4cf,mode=rw,path=/boot total=1551745024i,free=1339252736i,used=115187712i,used_percent=7.91972694092732,inodes_total=98304i,inodes_free=97992i,inodes_used=312i 1648658270000000000

Which does appear to align with the output from df -h:

/dev/mapper/ubuntu--vg-ubuntu--lv   12G  5.9G  5.1G  54% /
/dev/vda2                          1.5G  110M  1.3G   8% /boot

Inspecting /proc/mounts on the container vs host, it's way more verbose inside the container. There's all sorts of extra mounts I wouldn't expect to see if Docker was just dumbly mounting into /hostfs:

gopsutil will attempt to read from /proc/1/mountinfo and fallback to /proc/self/mountinfo. Here is mine for example:

# cat /proc/1/mounts  | grep mapper
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs ext4 ro,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /etc/resolv.conf ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /etc/hostname ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /etc/hosts ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /etc/telegraf/telegraf.conf ext4 rw,relatime 0 0

This makes a bit more sense than seeing /var/log and /usr/src. Is one of your environment variables mounting other things in your docker container? Any other odd mounts?

I am at a bit of a loss, but my next thing would be try with just docker, away from the compose in case some additional variable is messing with things, and run:

docker run -d --name=telegraf \
    -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro \
    -v /:/hostfs:ro \
    -e HOST_ETC=/hostfs/etc \
    -e HOST_PROC=/hostfs/proc \
    -e HOST_SYS=/hostfs/sys \
    -e HOST_VAR=/hostfs/var \
    -e HOST_RUN=/hostfs/run \
    -e HOST_MOUNT_PREFIX=/hostfs \
    telegraf
danepowell commented 2 years ago
docker run -d --name=telegraf2 \
    -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro \
    -v /:/hostfs:ro \
    -e HOST_ETC=/hostfs/etc \
    -e HOST_PROC=/hostfs/proc \
    -e HOST_SYS=/hostfs/sys \
    -e HOST_VAR=/hostfs/var \
    -e HOST_RUN=/hostfs/run \
    -e HOST_MOUNT_PREFIX=/hostfs \
    telegraf

Leads to the exact same results as using Docker Composer. Weird, right?

And then there's this:

$ docker exec -it telegraf2 cat /proc/1/mounts  | grep mapper
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/etc ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/etc/#fstab# ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/etc/.pwd.lock ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/etc/NetworkManager ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/etc/PackageKit ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/etc/UPower ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/etc/X11 ext4 rw,relatime 0 0
/dev/mapper/ubuntu--vg-ubuntu--lv /hostfs/etc/adduser.conf ext4 rw,relatime 0 0
<snip>

Am I reading this right that it's essentially mounting every single inode individually? As someone not overly familiar with the guts of Docker volume mounts, that seems wrong.

I'm running Docker 20.10.12 btw, since it seems we're getting into "it works on my machine, why not yours?" territory 😄

powersj commented 2 years ago

Leads to the exact same results as using Docker Composer. Weird, right?

Interesting, I'm a little happy that it reproduced the issue as it means it is not some weird compose/services file setting causing this and instead possible some setting or configuration with Docker.

Am I reading this right that it's essentially mounting every single inode individually?

Wow, that does seem quite unexpected. Can you gather docker info?

danepowell commented 2 years ago
$ docker info
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 22
  Running: 21
  Paused: 0
  Stopped: 1
 Images: 34
 Server Version: 20.10.12
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc version:
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-105-generic
 Operating System: Ubuntu Core 18
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.35GiB
 Name: daenerys
 ID: FUGJ:HZRT:PLGS:SGWD:XRKN:EVPE:U3ZH:XGQS:CLJM:W5HH:WGTC:W4LX
 Docker Root Dir: /var/snap/docker/common/var-lib-docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support
powersj commented 2 years ago

Operating System: Ubuntu Core 18

Aaaaaaah you are running docker as a snap. Once I switch to that I can reproduce this! Let me look into this further now. Just FYI I am out for the next week, so it might take a bit of time to get back to you.

Thanks for the back and forth, hopefully, we can figure out a solution from here.

danepowell commented 2 years ago

Telegraf as a snap

Docker as a snap, yep. This is not the first time that snaps have caused me trouble, maybe using it for Docker was a bad idea 🤕

powersj commented 2 years ago

Telegraf as a snap

Docker as a snap, yep. This is not the first time that snaps have caused me trouble, maybe using it for Docker was a bad idea face_with_head_bandage

Ah yes, I meant docker :)

danepowell commented 2 years ago

Thanks for the help triaging this. I've since uninstalled the Docker Snap and switched to the Apt version and this is no longer an issue.

aratik711 commented 2 years ago

@powersj can you help me understand why docker snap causes an issue ?

powersj commented 2 years ago

@danepowell sorry for the delay, I do want to get back and figure out what is going on here but have been in and out. I am glad to hear the non-snap route works for you.

@aratik711 I have not been able to dig into this yet, but my hunch was that since the docker snap is limited to files under $HOME, some permissions issue or other issue causes problems with bind mounts. Which results in all sorts of entries in /proc/1/mounts.

powersj commented 10 months ago

Hi,

Looking back and older issues. This was an issue with Docker when running as a snap. The disk input plugin will collect data from /proc/self/mounts or the mounted version of that. It appears that when using the snap additional items are also included in that file. As a result, the disk input pluign will also report on everything found in there.

User's who come across this could use ignore_fs to remove certain filesystem items or tagpass to only allow specific device/path/label/etc. from pass through.

I am going to close this as ultimately this is ultimately not an issue with Telegraf, but an affect of using the docker as a snap.