openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.46k stars 1.73k forks source link

Lack of inotify watches causing checksum errors #13837

Open Typhonragewind opened 2 years ago

Typhonragewind commented 2 years ago

System information

Type Version/Name
Distribution Name Ubuntu VM in Proxmox host (6.4-15)
Distribution Version 20.04.04 LTS
Kernel Version 5.4.0-125-generic
Architecture x64
OpenZFS Version 2.1.5-1~20.04.york0

System info: Motherboard: Supermicro X11SCH-F CPU: Intel® Xeon® E-2136 RAM: 32GB Crucial ECC Unbuffered 2666MHz HBA card: AOC-S2308L-L8i (IT mode) HDDs: WD RED 4TB (x6) in RaidZ2 configuration

-->

Describe the problem you're observing

System depletion of inotify watches due to high usage by processes causes ZFS to generate repairable checksum errors in disks in pool during normal usage. Increasing system inotify watches stops this behaviour. See https://www.reddit.com/r/homelab/comments/x4y5o2/zfs_checksums_a_tale_of_arcane_errors_and_how_to/ for troubleshooting steps taken.

Describe how to reproduce the problem

Set system max inotify watchers a small number and cause them to be depleted through some process. Then operations on files residing in the zfs pool will likely generate the errors

Include any warning/errors/backtraces from the system logs

image

Sep 1 09:48:26 docker-prod systemd[1]: zfs-mount.service: Failed to add control inotify watch descriptor for control group /system.slice/zfs-mount.service: No space left on device

Sep 1 09:48:26 docker-prod systemd[1]: zfs-zed.service: Failed to add control inotify watch descriptor for control group /system.slice/zfs-zed.service: No space left on device

Sep 1 09:48:26 docker-prod systemd[1]: zfs-load-module.service: Failed to add control inotify watch descriptor for control group /system.slice/zfs-load-module.service: No space left on device
Rudd-O commented 2 years ago

This is scary.

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.