node_filesystem_{size,avail}_bytes report wrong values for ZFS filesystems

Baughn commented 5 years ago

Host operating system: output of `uname -a`

Linux backup-target.atelieraphelion.com 5.0.0-29-generic #31~18.04.1-Ubuntu SMP Thu Sep 12 18:29:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

/nix/store/svm3ypaq5dyznfxr7lhqvk6ymyy0cs9n-node_exporter-0.17.0-bin/bin/node_exporter --version node_exporter, version (branch: , revision: ) build user:
build date:
go version: go1.12.7

node_exporter command line flags

/nix/store/svm3ypaq5dyznfxr7lhqvk6ymyy0cs9n-node_exporter-0.17.0-bin/bin/node_exporter --web.listen-address 0.0.0.0:9100

Are you running node_exporter in Docker?

No.

What did you do that produced an error?

Created a disk-space alert using node_filesystem_avail_bytes{fstype=~"ext4|zfs|xfs"} / node_filesystem_size_bytes < 0.1

What did you expect to see?

The alert should fire if available space is below 10%, as _avail_bytes should be <10% of size_bytes.

What did you see instead?

The alert never fires for ZFS filesystems, because _avail_bytes and _size_bytes are always equal.

knweiss commented 4 years ago

I think this is basically a duplicate of the closed (FreeBSD) issue #1287.

peterjeremy commented 1 year ago

Having just been bitten by this bug, the problem is a bit more subtle than implied by the initial description: The node_filesystem_avail_bytes and node_filesystem_size_bytes are being correctly calculated by the node_exporter code but are using stale (cached) values from the kernel.

In more detail, the unix.Getfsstat call specifies MNT_NOWAIT and getfsstat(2) states:

Normally mode should be specified as MNT_WAIT. If mode is set to MNT_NOWAIT, getfsstat() will return the information it has available without requesting an update from each file system.

And, having studied changes in node_filesystem_avail_bytes over time, as well as rummaging around in the FreeBSD kernel sources, it seems that the cached data is basically never updated under normal operations. This means that, unless something else (like df) invokes getfsstat with MNT_WAIT or statfs(2), the reported data will reflect the information from when the filesystem was created or mounted - rendering it useless for Prometheus alerting.

As for fixing the issue:

The most obvious "fix" is to use MNT_WAIT instead of MNT_NOWAIT but this runs the risk of blocking indefinitely if (e.g.) a NFS server becomes non-responsive.
A reasonable workaround is probably to stick with using MNT_NOWAIT but explicitly call statfs(2) on each non-NFS (or other "unsafe") filesystem.
I have raised https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=273094 suggesting that the current behaviour is a POLA violation.

discordianfish commented 1 year ago

The most obvious "fix" is to use MNT_WAIT instead of MNT_NOWAIT but this runs the risk of blocking indefinitely if (e.g.) a NFS server becomes non-responsive.

Feature parity with linux :). I'd say we go with this and consider using the stale mount handling implemented in #997 for linux

discordianfish commented 1 year ago

Now I'm confused though. @Baughn seems to run into this on linux, right? Or is there a similar bug in both?

prometheus / node_exporter