Open gentoo-root opened 8 months ago
Did you not create a infinite loop?
It might seem so, but not really: I expect mergefs to continue serving files from the original /tmp/overlay/lower, i.e. the content that was seen there before it was overlapped by the overlayfs mount. It's common for overlayfs to be mounted over its own lower directory (although it doesn't usually involve mergefs), and it's supposed to work this way.
Compare with bindfs, which continues to work as expected in this scenario, i.e. it can still access files underneath the new overlayfs mountpoint:
# mkdir -p /tmp/overlay/{bound,lower,upper,work}
# touch /tmp/overlay/lower/test
# bindfs /tmp/overlay/lower /tmp/overlay/bound
# mount -t overlay -o lowerdir=/tmp/overlay/bound,upperdir=/tmp/overlay/upper,workdir=/tmp/overlay/work none /tmp/overlay/lower
# ls /tmp/overlay/lower
test
Or with the regular bind:
# mkdir -p /tmp/overlay/{bound,lower,upper,work}
# touch /tmp/overlay/lower/test
# mount --bind /tmp/overlay/lower /tmp/overlay/bound
# mount -t overlay -o lowerdir=/tmp/overlay/bound,upperdir=/tmp/overlay/upper,workdir=/tmp/overlay/work none /tmp/overlay/lower
# ls /tmp/overlay/lower
test
Both work.
That's not how mergerfs works. It works on paths. Not file descriptors. That was an explicit design decision at the very beginning.
0) It was how some other union filesystems worked and so there was an expectation of similar functionality. 1) mergerfs is expected to be used with filesystems that may be flaky. Holding a open file descriptor on a broken filesystem can be problematic. 2) If a filesystem does go sideways you can force umount the filesystem without mergerfs having to be involved. 3) You can mount and umount filesystems from underneath mergerfs without having to orchestrate. 4) When mergerfs was written not all targeted systems had *at syscalls.
Whether all of those still matter or are worth the tradeoffs can certainly be debated but the fact is changing that behavior requires a complete rewrite of all interaction with the branches, changes what is possible with the product, the runtime API, and how people interact with failure conditions. IE... it won't be changed in mergerfs v2.x.
Thanks for the explanation. This behavior makes sense if it works on paths by design (although to me it's quite unusual, I see it has its usecases).
In any case, though, the steps above lead to a hangup in the kernel (in a syscall), and they can be reproduced without root permissions (using an unprivileged mount namespace). This hangup prevents normal unmounting, suspend and shutdown/reboot. Maybe it makes sense to detect the deadlock and bail out with an error to prevent this lockup?
To be honest, it's been a long while since I last used FUSE filesystems. Back in the day, you had to launch a SUID executable to mount a FUSE filesystem. That meant that unprivileged users couldn't install their own FUSE handlers, and then I'd say it would definitely make sense to fix any userspace deadlocks in mergerfs that block the kernel. Today I see Arch Linux installs /dev/fuse with mode 666, and mergerfs/bindfs executables are not SUID. With /dev/fuse having 666 permissions, any user can basically start a malicious FUSE driver that would block handling some syscalls, leading to the similar result, am I right?
How do you prevent a a deadlock like this? You've created an indirect loop. I'd have to crawl all mounts before every request, knowing the syntax of every filesystem or setup that this could happen with, and try to determine a loop. I couldn't make a request on the path because that would loop back around to mergerfs.
If you want to kill the fuse connection you go to /sys/fs/fuse/connections/X/ and echo something into abort
. That should take out mergerfs.
AFAIK you still need SUID app. fusermount.
I'd have to crawl all mounts before every request, knowing the syntax of every filesystem
I was thinking more of detecting recursive locks, but on second thought, the detection would need to happen on the overlayfs side (list files in overlayfs /var/tmp/lower -> list files in overlayfs lower /var/tmp/bound -> list files in mergerfs branch /var/tmp/lower -> back again to overlayfs, here the recursive locking happens).
If you want to kill the fuse connection you go to /sys/fs/fuse/connections/X/ and echo something into abort. That should take out mergerfs.
That indeed works! I didn't know about it, thanks for the tip.
AFAIK you still need SUID app. fusermount.
The filesystems themselves are not SUID, which means any user can create and start one. But abort via sysfs is a good enough mitigation (except when you are rebooting, and it's too late to use the shell).
Correct. Anyone can create and start one. That's one of the points of the technology. This issue isn't unique to mergerfs. Any fuse filesystem can block. It doesn't ultimately have anything to do with this loop. If I put sleep(~0) into every thread the same would happen. A broken filesystem or device that blocks causes mergerfs to block on a thread in a syscall would cause the same. It is a fundamental risk with how the whole system is designed. It's not like I can put timeouts on syscalls. The best I could do is create a watchdog system to detect stuck threads but then what? I can't do much about it.
Describe the bug
When an overlayfs is mounted on top of one of the branches of mergerfs, attempt to access it leads to an uninterruptible hang of one of fuse.read threads of mergerfs in the D state.
To Reproduce
Expected behavior
ls should list the files in /tmp/overlay/lower. If mergerfs is replaced by bindfs, it works as expected, allowing to list and create files in the overlayfs.
System information:
Linux *** 6.7.8-arch1-1 #2 SMP PREEMPT_DYNAMIC Wed, 06 Mar 2024 16:30:59 +0000 x86_64 GNU/Linux
mergerfs v2.40.2
df -h | grep /tmp
:strace -fvTtt -s 256 -o /tmp/app.strace.txt <cmd>
: app.strace.txtstrace -fvTtt -s 256 -o /tmp/mergerfs.strace.txt <cmd>
: mergerfs.strace.txtAdditional context
An attempt to suspend fails with the following in dmesg:
Hung task detector also prints these stacktraces after a while.