zfs rollback locks kernel mount table for several minutes

stuartthebruce commented 1 year ago

System information

Type	Version/Name
Distribution Name	Rocky Linux
Distribution Version	9.2
Kernel Version	5.14.0-284.11.1.el9_2
Architecture	x86_64
OpenZFS Version	2.1.12

Describe the problem you're observing

zfs rollback locks access to /proc/self/mountstats for several minutes, which blocks other standard programs from running--most annoyingly new ssh sessions to login and debug, and some NFS server services that in turn block NFS clients.

Describe how to reproduce the problem

Run syncoid to transfer a large dataset and mistakenly leave the filesystem mounted so that subsequent updates will trigger a zfs rollback.

Include any warning/errors/backtraces from the system logs

Small test cases do not cause a noticeable problem, but incremental updates to large TB filesystems show the problem if a rollback is needed. Manually killing the zfs rollback process takes several for the process to exit and release the lock on the mount table. An example stack trace from a hung process waiting for the lock is available in #13858. While waiting for the zfs rollback process to finish exiting /bin/top shows zfs rollback accumulating large amounts of cpu time and arc_prune and arc_evict are often busy as well (but this is a production system so they might be pruning and evicting other ARC data). However, the zpool is otherwise healthy and other zfs send/recv process and existing NFSD kernel threads are unaffected.

Is this likely a bug, or a design feature for rolling back large datasets?

Note, zfs destroy on the dataset being rolled back to does not expedite the time for the zfs rollback process to exit even though the destroy runs promptly.

stuartthebruce commented 1 year ago

I have confirmed that setting canmount = off on the target system prevents any problems running syncoid on large filesystems. Presumably for a mounted filesystem zfs rollback is holding the kernel mount table lock for a long time (several minutes) while waiting for ARC cleanup before umount finishes. If so, how about performing an initial ARC flush before acquiring the lock? Or otherwise significantly reducing the lock time by a few orders of magnitude?

stuartthebruce commented 1 year ago

FYI, I suspect the reason that there was significant arc_prune and arc_evictactivity when the sanoid target filesystems were mounted is that the target server is periodically running https://github.com/zevv/duc to index all of it's files.

openzfs / zfs