Stuck NFS mount unexpected behavior (+ Proposal)

mtknapp commented 5 years ago

Host operating system: output of `uname -a`

Linux hostname 4.14.97#1 SMP Fri Feb 1 14:23:07 EST 2019 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

0.17.0

node_exporter command line flags

--no-collector.arp \
    --no-collector.cpu \
    --no-collector.diskstats \
    --no-collector.edac \
    --no-collector.filefd \
    --collector.filesystem \
    --no-collector.hwmon \
    --no-collector.interrupts \
    --no-collector.loadavg \
    --no-collector.mdadm \
    --no-collector.meminfo \
    --no-collector.mountstats \
    --no-collector.netdev \
    --no-collector.netstat \
    --no-collector.sockstat \
    --no-collector.stat \
    --no-collector.systemd \
    --no-collector.tcpstat \
    --no-collector.textfile \
    --no-collector.uname \
    --no-collector.vmstat \
    --no-collector.zfs \
    --no-collector.bcache \
    --no-collector.conntrack \
    --no-collector.infiniband \
    --no-collector.ipvs \
    --no-collector.wifi \
    --no-collector.xfs \
    --no-collector.nfs \
    --no-collector.nfsd \
    --collector.textfile.directory /var/lib/node_exporter/textfile_collector \
    --collector.systemd.unit-blacklist=".*\\.(device|mount|swap|scope|slice)$" \
    --collector.filesystem.ignored-mount-points="^/(sys|proc|dev)($|/)" \
    --collector.diskstats.ignored-devices="^(sr|ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\\d+n\\d+p|sr)\\d+$" \
    --collector.filesystem.ignored-fs-types="^beegfs_nodev|beegfs|binfmt_misc|cgroup|devpts|fusectl|mqueue|proc|pstore|(auto|debug|devtmp|hugetlb|rpc_pipe|sys|tmp|trace)f
    --collector.vmstat.fields="^(oom_kill|pgpg|pswp|pg.*fault|pgsteal|pgscan|hugetlb).*" \
    --log.level="debug"

Are you running node_exporter in Docker?

Nope

What did you do that produced an error?

With an NFS mount goes stale it usually doesn't return from the stat call until it becomes responsive again. We had a nfs server issue that caused several mounts to go stale, but this time around the stat call return an Input/Output error after ~3 minutes. This meant that every 3 minutes, it would return from the call, mark the mount as unstuck, try to query it again the next time and hang again. Given that we have a 1 minute scrape interval, this meant that ~1/3 scrapes were failing/timing out. This lasted for a while but eventually stopped. I haven't been able to reproduce this situation yet unfortunately so it's hard to test. Will update if I find a good way to recreate it.

What did you expect to see?

The current implementation doesn't take into account the returned error when marking the mount as unstuck, it will always do it when the stat call returns. I propose that the mount should only be considered unstuck if it returns without an error. It's already returning a device error at this point. I played around with an implementation that if an error returns it will start a new thread that continually checks the mount until it returns a non-error. In that case it will mark the mount as "unstuck" and resume monitoring. If this seems like a reasonable thing to do then I will open it as a PR.

If anyone has any suggestions how to reproduce the error message I'm all ears. So far I've tried shutting down the server, both gracefully and hard. Setting up iptables rules to drop incoming/outgoing packets on the server and on the client. Stopping portmap. No luck so far :(

owensuls commented 4 years ago

We are seeing this type of problem also very frequently.

discordianfish commented 4 years ago

I'm actually not sure we should change anything here. The problem is that the NFS mount point is stuck and the node-exporter provides enough metrics to monitor for that. @SuperQ wdyt?

prometheus / node_exporter