trapexit / mergerfs

a featureful union filesystem
http://spawn.link
Other
4.32k stars 174 forks source link

Hangs when overlayfs is mounted on top of mergerfs's branch #1319

Open gentoo-root opened 8 months ago

gentoo-root commented 8 months ago

Describe the bug

When an overlayfs is mounted on top of one of the branches of mergerfs, attempt to access it leads to an uninterruptible hang of one of fuse.read threads of mergerfs in the D state.

To Reproduce

mkdir -p /tmp/overlay/{bound,lower,upper,work}
mergerfs /tmp/overlay/lower /tmp/overlay/bound
mount -t overlay -o lowerdir=/tmp/overlay/bound,upperdir=/tmp/overlay/upper,workdir=/tmp/overlay/work none /tmp/overlay/lower
ls -l /tmp/overlay/lower

Expected behavior

ls should list the files in /tmp/overlay/lower. If mergerfs is replaced by bindfs, it works as expected, allowing to list and create files in the overlayfs.

System information:

Additional context

An attempt to suspend fails with the following in dmesg:

[  +0,000104] task:fuse.read       state:D stack:0     pid:5470  tgid:5464  ppid:1      flags:0x00000006
[  +0,000011] Call Trace:
[  +0,000003]  <TASK>
[  +0,000007]  __schedule+0x3e7/0x1410
[  +0,000013]  ? generic_fillattr+0x49/0x120
[  +0,000011]  ? shmem_getattr+0x7b/0xe0
[  +0,000011]  schedule+0x32/0xd0
[  +0,000004]  schedule_preempt_disabled+0x15/0x30
[  +0,000005]  rwsem_down_read_slowpath+0x2aa/0x540
[  +0,000008]  down_read_killable+0x48/0xd0
[  +0,000006]  iterate_dir+0x5e/0x150
[  +0,000006]  __x64_sys_getdents64+0x88/0x130
[  +0,000006]  ? __pfx_filldir64+0x10/0x10
[  +0,000006]  do_syscall_64+0x61/0xe0
[  +0,000006]  ? syscall_exit_to_user_mode+0x2b/0x40
[  +0,000006]  ? do_syscall_64+0x70/0xe0
[  +0,000004]  ? __count_memcg_events+0x42/0x90
[  +0,000009]  ? count_memcg_events.constprop.0+0x1a/0x30
[  +0,000005]  ? handle_mm_fault+0xa2/0x360
[  +0,000008]  ? do_user_addr_fault+0x304/0x670
[  +0,000010]  ? exc_page_fault+0x7f/0x180
[  +0,000005]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  +0,000010] RIP: 0033:0x7bc6fbef2eb7
[  +0,000067] RSP: 002b:00007bc6f93f72b8 EFLAGS: 00000293 ORIG_RAX: 00000000000000d9
[  +0,000006] RAX: ffffffffffffffda RBX: 00007bc6dc103010 RCX: 00007bc6fbef2eb7
[  +0,000003] RDX: 0000000000008000 RSI: 00007bc6dc103040 RDI: 0000000000000006
[  +0,000003] RBP: 00007bc6dc103014 R08: 0000000000000000 R09: 0000000000000001
[  +0,000003] R10: 0000000000000004 R11: 0000000000000293 R12: 00007bc6dc103040
[  +0,000003] R13: ffffffffffff7728 R14: 0000000000000000 R15: 00007bc6e4000bd0
[  +0,000007]  </TASK>
[  +0,000002] task:ls              state:D stack:0     pid:5504  tgid:5504  ppid:4218   flags:0x00004006
[  +0,000007] Call Trace:
[  +0,000002]  <TASK>
[  +0,000002]  __schedule+0x3e7/0x1410
[  +0,000004]  ? autoremove_wake_function+0x15/0x70
[  +0,000009]  ? __wake_up_common+0x75/0xa0
[  +0,000009]  schedule+0x32/0xd0
[  +0,000008]  request_wait_answer+0x1ba/0x2b0 [fuse 723f85863d26a4b806cc14560d4aa7b2f5ae27a6]
[  +0,000027]  ? __pfx_autoremove_wake_function+0x10/0x10
[  +0,000008]  fuse_simple_request+0x17e/0x2c0 [fuse 723f85863d26a4b806cc14560d4aa7b2f5ae27a6]
[  +0,000025]  fuse_readdir_uncached+0x196/0x840 [fuse 723f85863d26a4b806cc14560d4aa7b2f5ae27a6]
[  +0,000030]  ? finish_wait+0x3c/0xa0
[  +0,000007]  ? request_wait_answer+0xf4/0x2b0 [fuse 723f85863d26a4b806cc14560d4aa7b2f5ae27a6]
[  +0,000024]  ? kmem_cache_free+0x22/0x380
[  +0,000011]  fuse_readdir+0x15c/0x870 [fuse 723f85863d26a4b806cc14560d4aa7b2f5ae27a6]
[  +0,000027]  ? fuse_open_common+0x1ce/0x270 [fuse 723f85863d26a4b806cc14560d4aa7b2f5ae27a6]
[  +0,000028]  ? __pfx_fuse_dir_open+0x10/0x10 [fuse 723f85863d26a4b806cc14560d4aa7b2f5ae27a6]
[  +0,000024]  iterate_dir+0x90/0x150
[  +0,000009]  ovl_dir_read_merged+0x1e1/0x2c0 [overlay dc64d64290c22acd7a62155226c0519c75834871]
[  +0,000036]  ? __pfx_ovl_fill_merge+0x10/0x10 [overlay dc64d64290c22acd7a62155226c0519c75834871]
[  +0,000034]  ovl_iterate+0x20b/0x310 [overlay dc64d64290c22acd7a62155226c0519c75834871]
[  +0,000033]  ? __pfx_ovl_iterate+0x10/0x10 [overlay dc64d64290c22acd7a62155226c0519c75834871]
[  +0,000031]  wrap_directory_iterator+0x48/0x70
[  +0,000006]  iterate_dir+0x90/0x150
[  +0,000005]  __x64_sys_getdents64+0x88/0x130
[  +0,000005]  ? __pfx_filldir64+0x10/0x10
[  +0,000006]  do_syscall_64+0x61/0xe0
[  +0,000005]  ? exc_page_fault+0x7f/0x180
[  +0,000005]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  +0,000008] RIP: 0033:0x70c46b94ceb7
[  +0,000014] RSP: 002b:00007ffd8e6ce458 EFLAGS: 00000293 ORIG_RAX: 00000000000000d9
[  +0,000004] RAX: ffffffffffffffda RBX: 000061680bf6f6a0 RCX: 000070c46b94ceb7
[  +0,000003] RDX: 0000000000008000 RSI: 000061680bf6f6d0 RDI: 0000000000000003
[  +0,000002] RBP: 000061680bf6f6a4 R08: 0000000000000000 R09: 0000000000000001
[  +0,000003] R10: 0000000000000004 R11: 0000000000000293 R12: 000061680bf6f6d0
[  +0,000002] R13: ffffffffffffff88 R14: 0000000000000000 R15: 0000000000000006
[  +0,000005]  </TASK>

Hung task detector also prints these stacktraces after a while.

trapexit commented 8 months ago

Did you not create a infinite loop?

gentoo-root commented 8 months ago

It might seem so, but not really: I expect mergefs to continue serving files from the original /tmp/overlay/lower, i.e. the content that was seen there before it was overlapped by the overlayfs mount. It's common for overlayfs to be mounted over its own lower directory (although it doesn't usually involve mergefs), and it's supposed to work this way.

Compare with bindfs, which continues to work as expected in this scenario, i.e. it can still access files underneath the new overlayfs mountpoint:

# mkdir -p /tmp/overlay/{bound,lower,upper,work}
# touch /tmp/overlay/lower/test
# bindfs /tmp/overlay/lower /tmp/overlay/bound
# mount -t overlay -o lowerdir=/tmp/overlay/bound,upperdir=/tmp/overlay/upper,workdir=/tmp/overlay/work none /tmp/overlay/lower
# ls /tmp/overlay/lower
test

Or with the regular bind:

# mkdir -p /tmp/overlay/{bound,lower,upper,work}
# touch /tmp/overlay/lower/test
# mount --bind /tmp/overlay/lower /tmp/overlay/bound
# mount -t overlay -o lowerdir=/tmp/overlay/bound,upperdir=/tmp/overlay/upper,workdir=/tmp/overlay/work none /tmp/overlay/lower
# ls /tmp/overlay/lower
test

Both work.

trapexit commented 8 months ago

That's not how mergerfs works. It works on paths. Not file descriptors. That was an explicit design decision at the very beginning.

0) It was how some other union filesystems worked and so there was an expectation of similar functionality. 1) mergerfs is expected to be used with filesystems that may be flaky. Holding a open file descriptor on a broken filesystem can be problematic. 2) If a filesystem does go sideways you can force umount the filesystem without mergerfs having to be involved. 3) You can mount and umount filesystems from underneath mergerfs without having to orchestrate. 4) When mergerfs was written not all targeted systems had *at syscalls.

Whether all of those still matter or are worth the tradeoffs can certainly be debated but the fact is changing that behavior requires a complete rewrite of all interaction with the branches, changes what is possible with the product, the runtime API, and how people interact with failure conditions. IE... it won't be changed in mergerfs v2.x.

gentoo-root commented 8 months ago

Thanks for the explanation. This behavior makes sense if it works on paths by design (although to me it's quite unusual, I see it has its usecases).

In any case, though, the steps above lead to a hangup in the kernel (in a syscall), and they can be reproduced without root permissions (using an unprivileged mount namespace). This hangup prevents normal unmounting, suspend and shutdown/reboot. Maybe it makes sense to detect the deadlock and bail out with an error to prevent this lockup?

To be honest, it's been a long while since I last used FUSE filesystems. Back in the day, you had to launch a SUID executable to mount a FUSE filesystem. That meant that unprivileged users couldn't install their own FUSE handlers, and then I'd say it would definitely make sense to fix any userspace deadlocks in mergerfs that block the kernel. Today I see Arch Linux installs /dev/fuse with mode 666, and mergerfs/bindfs executables are not SUID. With /dev/fuse having 666 permissions, any user can basically start a malicious FUSE driver that would block handling some syscalls, leading to the similar result, am I right?

trapexit commented 8 months ago

How do you prevent a a deadlock like this? You've created an indirect loop. I'd have to crawl all mounts before every request, knowing the syntax of every filesystem or setup that this could happen with, and try to determine a loop. I couldn't make a request on the path because that would loop back around to mergerfs.

If you want to kill the fuse connection you go to /sys/fs/fuse/connections/X/ and echo something into abort. That should take out mergerfs.

AFAIK you still need SUID app. fusermount.

gentoo-root commented 8 months ago

I'd have to crawl all mounts before every request, knowing the syntax of every filesystem

I was thinking more of detecting recursive locks, but on second thought, the detection would need to happen on the overlayfs side (list files in overlayfs /var/tmp/lower -> list files in overlayfs lower /var/tmp/bound -> list files in mergerfs branch /var/tmp/lower -> back again to overlayfs, here the recursive locking happens).

If you want to kill the fuse connection you go to /sys/fs/fuse/connections/X/ and echo something into abort. That should take out mergerfs.

That indeed works! I didn't know about it, thanks for the tip.

AFAIK you still need SUID app. fusermount.

The filesystems themselves are not SUID, which means any user can create and start one. But abort via sysfs is a good enough mitigation (except when you are rebooting, and it's too late to use the shell).

trapexit commented 8 months ago

Correct. Anyone can create and start one. That's one of the points of the technology. This issue isn't unique to mergerfs. Any fuse filesystem can block. It doesn't ultimately have anything to do with this loop. If I put sleep(~0) into every thread the same would happen. A broken filesystem or device that blocks causes mergerfs to block on a thread in a syscall would cause the same. It is a fundamental risk with how the whole system is designed. It's not like I can put timeouts on syscalls. The best I could do is create a watchdog system to detect stuck threads but then what? I can't do much about it.