prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.07k stars 2.34k forks source link

node_filesystem_{size,avail}_bytes report wrong values #1505

Closed x652001 closed 4 years ago

x652001 commented 5 years ago

Host operating system: output of uname -a

Linux us-cdn 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux CentOS Linux release 7.7.1908 (Core)

node_exporter version: output of node_exporter --version

node_exporter, version 0.18.1 (branch: HEAD, revision: 3db77732e925c08f675d7404a8c46466b2ece83e) build user: root@b50852a1acba build date: 20190604-16:41:18 go version: go1.12.5

node_exporter command line flags

/usr/local/bin/node_exporter --collector.systemd --collector.textfile --collector.textfile.directory=/var/lib/node_exporter --web.listen-address=0.0.0.0:9100

Are you running node_exporter in Docker?

No

What did you do that produced an error?

curl localhost:9100/metrics | grep node_filesystem_avail_bytes

# TYPE node_filesystem_avail_bytes gauge
node_filesystem_avail_bytes{device="/dev/mapper/centos-home",fstype="xfs",mountpoint="/home"} 1.795022848e+09
node_filesystem_avail_bytes{device="/dev/mapper/centos-root",fstype="xfs",mountpoint="/"} 5.0804396032e+10
node_filesystem_avail_bytes{device="/dev/sda1",fstype="xfs",mountpoint="/boot"} 4.19434496e+08
node_filesystem_avail_bytes{device="rootfs",fstype="rootfs",mountpoint="/"} 5.0804396032e+10
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 1.795022848e+09`

df -h

Filesystem                Size  Used Avail Use% Mounted on 
/dev/mapper/centos-root   50G  2.7G   48G    6% /
devtmpfs                 1.9G     0  1.9G    0% /dev
tmpfs                    1.9G     0  1.9G    0% /dev/shm
tmpfs                    1.9G  185M  1.7G   10% /run
tmpfs                    1.9G     0  1.9G    0% /sys/fs/cgroup
/dev/mapper/centos-home   46G   31G   16G   67% /home
/dev/sda1                497M   97M  401M   20% /boot
tmpfs                    380M     0  380M    0% /run/user/0

The value of {mountpoint="/home"} and {mountpoint="/run"} are the same but those value are different from df -h

What did you expect to see?

The value of the node_filesystem_avail_bytes{device="/dev/mapper/centos-home",fstype="xfs",mountpoint="/home"} should be as same as df -h

What did you see instead?

Correct value from df -h

SuperQ commented 5 years ago

Are you sure you're running 3.10.0-123? This is the original CentOS 7 kernel from 2014.

This feels like a kernel bug.

discordianfish commented 4 years ago

Assuming this is a kernel bug and closing since there was no response

justinfenn commented 4 years ago

I'm seeing this as well with a current kernel and node-exporter v0.18.1. It's also happening with /home on my system, and it seems to be getting a value from a tmpfs as in the original report.

$ uname -a
Linux server 5.3.7-arch1-1-ARCH #1 SMP PREEMPT Fri Oct 18 00:17:03 UTC 2019 x86_64 GNU/Linux

Data from node-exporter:

node_filesystem_avail_bytes{device="run",fstype="tmpfs",mountpoint="/run"} 8.388939776e+09
node_filesystem_device_error{device="run",fstype="tmpfs",mountpoint="/run"} 0
node_filesystem_files{device="run",fstype="tmpfs",mountpoint="/run"} 2.048422e+06
node_filesystem_files_free{device="run",fstype="tmpfs",mountpoint="/run"} 2.047586e+06
node_filesystem_free_bytes{device="run",fstype="tmpfs",mountpoint="/run"} 8.388939776e+09
node_filesystem_readonly{device="run",fstype="tmpfs",mountpoint="/run"} 0
node_filesystem_size_bytes{device="run",fstype="tmpfs",mountpoint="/run"} 8.390336512e+09
node_filesystem_avail_bytes{device="/dev/nvme0n1p3",fstype="ext4",mountpoint="/home"} 8.388939776e+09
node_filesystem_device_error{device="/dev/nvme0n1p3",fstype="ext4",mountpoint="/home"} 0
node_filesystem_files{device="/dev/nvme0n1p3",fstype="ext4",mountpoint="/home"} 2.048422e+06
node_filesystem_files_free{device="/dev/nvme0n1p3",fstype="ext4",mountpoint="/home"} 2.047586e+06
node_filesystem_free_bytes{device="/dev/nvme0n1p3",fstype="ext4",mountpoint="/home"} 8.388939776e+09
node_filesystem_readonly{device="/dev/nvme0n1p3",fstype="ext4",mountpoint="/home"} 0
node_filesystem_size_bytes{device="/dev/nvme0n1p3",fstype="ext4",mountpoint="/home"} 8.390336512e+09

Output from df -h:

run                      7.9G  1.4M  7.9G   1% /run
/dev/nvme0n1p3           147G   98M  140G   1% /home
justinfenn commented 4 years ago

Just checked with master and it seems to be working correctly. I didn't try to track down the fix, but it looks like it will probably be good with the next release.

discordianfish commented 4 years ago

@justinfenn Thanks for confirming!

justinfenn commented 4 years ago

Sorry to bump an old issue, but I think I know what happened, and maybe it will be useful to someone else who encounters this issue. I was using node_exporter from the Arch package, and that was setting ProtectHome=yes in the unit file. This bug report sounds like basically the same issue as this one, and it was recently fixed.

In my case, when I ran from the master branch to test, I just started node_exporter directly and didn't run it as a service, so I avoided the issue and saw the correct sizes. I just assumed that there had been some code change that fixed the issue, but it was actually a configuration problem the whole time.

mikegerber commented 11 months ago

Update: Didn't check that I run the latest release. I don't, will update and check again.

I'm seeing this issue with node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959), also running with ProtectHome=yes (and ProtectSystem=full). It seems to get the value for e.g. /home from ... the tmpfs.

I'll investigate further and have a look at the code.

mikegerber commented 11 months ago

It's fixed by ProtectHome=read-only - updating the Ansible Galaxy role takes care of that, if you use it. I didn't check if node_exporter 1.6.1 fixed a possible regression.

Why I would call it a bug if it still exists: with ProtectHome=yes, node_exporter can't access /home. This is expected. It should however not give a value in this case. To understand the problem better I checked with df in ProtectHome=yes service: It does not list /home at all - correctly! - in e.g. df -h; when asked directly for the mount, e.g. df -h /home, it also reports the same incorrect value instead of an error (might be how it is intended/setup.)

Anyway just reporting back in case someone else encounters this.

SuperQ commented 11 months ago

@mikegerber It would be interesting to see the difference of why df -h is not listing filesystems with ProtectHome=yes.

mikegerber commented 11 months ago

@mikegerber It would be interesting to see the difference of why df -h is not listing filesystems with ProtectHome=yes.

It does list filesystems, but not the inaccesible /home. I think this is the best behavior in this configuration.

I put some commands in sh -x here, should illustrate well:

❯ sudo journalctl -u test-df-protecthome.service | cat
Nov 03 13:53:14 leguin systemd[1]: Starting test-df-protecthome.service - Test df vs. ProtectHome and ProtectSystem...
Nov 03 13:53:14 leguin sh[1089098]: + /usr/bin/df -h
Nov 03 13:53:14 leguin sh[1089098]: Filesystem                                    Size  Used Avail Use% Mounted on
Nov 03 13:53:14 leguin sh[1089098]: /dev/mapper/vg_leguin-root                     45G   37G  5.2G  88% /
Nov 03 13:53:14 leguin sh[1089098]: tmpfs                                         4.0M     0  4.0M   0% /sys/fs/cgroup
Nov 03 13:53:14 leguin sh[1089098]: efivarfs                                      154K   39K  111K  26% /sys/firmware/efi/efivars
Nov 03 13:53:14 leguin sh[1089098]: devtmpfs                                      4.0M     0  4.0M   0% /dev
Nov 03 13:53:14 leguin sh[1089098]: tmpfs                                         7.8G     0  7.8G   0% /dev/shm
Nov 03 13:53:14 leguin sh[1089098]: tmpfs                                         3.1G  2.0M  3.1G   1% /run
Nov 03 13:53:14 leguin sh[1089098]: tmpfs                                         7.8G  3.7M  7.8G   1% /tmp
Nov 03 13:53:14 leguin sh[1089098]: /dev/nvme0n1p3                                474M  264M  182M  60% /boot
Nov 03 13:53:14 leguin sh[1089098]: /dev/nvme0n1p1                                256M   20M  237M   8% /boot/efi
Nov 03 13:53:14 leguin sh[1089098]: /dev/mapper/vg_leguin-halde--tmp               50G  9.2G   40G  19% /halde-tmp
Nov 03 13:53:14 leguin sh[1089098]: /dev/mapper/vg_leguin-srv_backup--archiv       49G  1.5G   45G   4% /srv/backup-archiv
Nov 03 13:53:14 leguin sh[1089098]: /dev/mapper/vg_leguin-var_lib_docker           59G   11G   46G  20% /var/lib/docker
Nov 03 13:53:14 leguin sh[1089098]: /dev/mapper/vg_leguin-var_lib_flatpak          20G  5.8G   13G  32% /var/lib/flatpak
Nov 03 13:53:14 leguin sh[1089098]: /dev/mapper/vg_leguin-var_lib_libvirt_images   99G   44G   51G  47% /var/lib/libvirt/images
Nov 03 13:53:14 leguin sh[1089099]: + ls -d /home
Nov 03 13:53:14 leguin sh[1089100]: /home
Nov 03 13:53:14 leguin sh[1089099]: + ls /home
Nov 03 13:53:14 leguin sh[1089101]: ls: cannot open directory '/home': Permission denied
Nov 03 13:53:14 leguin sh[1089099]: + df -h /home
Nov 03 13:53:14 leguin sh[1089099]: Filesystem      Size  Used Avail Use% Mounted on
Nov 03 13:53:14 leguin sh[1089099]: tmpfs           3.1G  2.0M  3.1G   1% /home
Nov 03 13:53:14 leguin systemd[1]: test-df-protecthome.service: Deactivated successfully.
Nov 03 13:53:14 leguin systemd[1]: Finished test-df-protecthome.service - Test df vs. ProtectHome and ProtectSystem.

(df is /usr/bin/df, wasn't being very consistent with this test. But I checked so I don't confuse things.)

mikegerber commented 11 months ago

I'll check with node_exporter 1.6.1 again next week (I have a vacation day today 😃), altough my immediate problem is solved by the Ansible Galaxy role now configuring ProtectedHome=read-only (which is is correct in my case - needed specifically correct values for /home)

mikegerber commented 11 months ago

With node_exporter 1.6.1 running with ProtectHome=yes:

 % curl -s http://localhost:9100/metrics | egrep '^node_exporter_build_info|^node_filesystem_avail_bytes.*home'
node_exporter_build_info{branch="HEAD",goarch="amd64",goos="linux",goversion="go1.20.6",revision="4a1b77600c1873a8233f3ffb55afcedbb63b8d84",tags="netgo osusergo static_build",version="1.6.1"} 1
node_filesystem_avail_bytes{device="/dev/mapper/san-data0",fstype="ext4",mountpoint="/home"} 2.64077406208e+11

That value of ~264 GB is the same value the /run tmpfs on the system reports.

With ProtectHome=read-only the value is correct as expected (~204 GB):

% curl -s http://localhost:9100/metrics | egrep '^node_exporter_build_info|^node_filesystem_avail_bytes.*home'
node_exporter_build_info{branch="HEAD",goarch="amd64",goos="linux",goversion="go1.20.6",revision="4a1b77600c1873a8233f3ffb55afcedbb63b8d84",tags="netgo osusergo static_build",version="1.6.1"} 1
node_filesystem_avail_bytes{device="/dev/mapper/san-data0",fstype="ext4",mountpoint="/home"} 2.0448495616e+11

(Note: The df output two comments above is from a different system (Fedora 37), while the node_exporter output is from the elderly CentOS system I first encountered the problem on. Don't think it matters, the behavior is the same. )