utimensat - EPERM on write

DavidFair commented 9 months ago

Describe the bug

Using QBT with both mmap and POSIX modes fails to write with EPERM, it appears to write some data to the FD before failing out each time so there's clearly write perms for the contents at least:

Running strace against mergerfs results in the following

979   read(4,  <unfinished ...>
986   pwrite64(42, "\16\210\177p\216\231\206F\202/E\207a\232}*o\343\303$[/?\206\4x\2611\362.\215\\"..., 20480, 216293376) = 20480
986   writev(4, [{iov_base="\30\0\0\0\0\0\0\0\372\23\327O\0\0\0\0", iov_len=16}, {iov_base="\0P\0\0\0\0\0\0", iov_len=8}], 2) = 24
986   read(4,  <unfinished ...>
985   <... read resumed>"\200\0\0\0\4\0\0\0\374\23\327O\0\0\0\0(\3\17\0\0\0\0\0x\5\0\0x\5\0\0"..., 1052672) = 128
985   utimensat(42, NULL, [UTIME_OMIT, {tv_sec=1695652477, tv_nsec=659721176} /* 2023-09-25T14:34:37.659721176+0000 */], 0) = -1 EPERM (Operation not permitted)
985   writev(4, [{iov_base="\20\0\0\0\377\377\377\377\374\23\327O\0\0\0\0", iov_len=16}], 1) = 16

From the application POV the pwrite64 call fails with EPERM instead (which is what makes me think this is possibly a mergerfs issue):

2335987 <... pwrite64 resumed>)         = 16384
2335987 epoll_ctl(8, EPOLL_CTL_MOD, 7, {events=EPOLLIN|EPOLLERR|EPOLLET, data={u32=2294506160, u64=139885084369584}}) = 0
2300315 <... epoll_pwait resumed>[{events=EPOLLIN, data={u32=2294506160, u64=139885084369584}}], 128, 381, NULL, 8) = 1
2335987 pwrite64(72, "\22.\332\2\231\217\224\231\vs\331\357\3203\203\365\361\305\271]+\253f;\345\202C\244\33\30\16\311"..., 16384, 2336702464 <unfinished ...>
2300315 epoll_pwait(8,  <unfinished ...>
2329691 <... pwrite64 resumed>)         = 16384
2329691 epoll_ctl(8, EPOLL_CTL_MOD, 7, {events=EPOLLIN|EPOLLERR|EPOLLET, data={u32=2294506160, u64=139885084369584}}) = 0
2300315 <... epoll_pwait resumed>[{events=EPOLLIN, data={u32=2294506160, u64=139885084369584}}], 128, 367, NULL, 8) = 1
2329691 pwrite64(40, "\373\177\315\347L\337\356\250\36h0\22\234\2}\6\30\tj\2002\227\330\7\302\21u\245\327%/\37"..., 16384, 332420309 <unfinished ...>
2300315 epoll_pwait(8,  <unfinished ...>
2335986 <... pwrite64 resumed>)         = 16384
2335986 epoll_ctl(8, EPOLL_CTL_MOD, 7, {events=EPOLLIN|EPOLLERR|EPOLLET, data={u32=2294506160, u64=139885084369584}}) = 0
2300315 <... epoll_pwait resumed>[{events=EPOLLIN, data={u32=2294506160, u64=139885084369584}}], 128, 367, NULL, 8) = 1
2335986 pwrite64(40, "<\6\22\0:\261\345e\345n\7\203\377\236\10\340|\275\274Ud\3\336\372\322\3\342\300\16\f\10\0"..., 16384, 332453077 <unfinished ...>
2300315 epoll_pwait(8,  <unfinished ...>
2335459 <... pwrite64 resumed>)         = -1 EPERM (Operation not permitted)

I can see the perms are correct on both the mergerfs mount point, and all filesystems underneath showing the same UID/GIDs, and the process is running as containers as expected:

drwxrwxr-x 2 containers containers 8.0K Sep 25 14:05  .
drwxrwxr-x 5 containers containers  190 Sep 24 17:14  ..
-rw-rw-r-- 1 containers containers 349M Sep 13 05:47 'some_file.!qB'
-rw-r--r-- 1 containers containers 349M Sep 24 17:58 'some_file1.!qB'
-rw-r--r-- 1 containers containers    0 Sep 25 07:19 'some_file2.!qB'
-rw-rw-r-- 1 containers containers 349M Sep 13 06:17 'some_file3.!qB'
-rw-rw-r-- 1 containers containers 349M Sep 12 23:06 'some_file4.!qB'

To Reproduce

In the container, as the user abc with the same UID and GID I can do the following in the same directory

touch test.\!qB
fallocate test_fill.\!qB -l 1G
dd if=/dev/urandom of=test_fill.\!qB bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.0315 s, 354 MB/s

So I think this is some interaction between a specific syscall and mergerfs, since QBT isn't requesting utimensat but mergerfs does. However, I'm not sure how to diagnose it further....

Expected behavior

The operation does not fail with EPERM

System information:

# uname -a
Linux download-server 6.5.0-5-generic #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Sep  6 15:11:07 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
# mergerfs -v
mergerfs v2.37.1

# cat /etc/fstab  # trimmed for brevity
/dev/disk/by-uuid/d28cf3c7-ddca-4cd9-af2f-213f0f9f8ad3 /mnt/merger/d1 xfs defaults,noatime 0 0
/dev/disk/by-uuid/f7579ac5-9a38-4b0d-a668-eb75ed79b684 /mnt/merger/d2 xfs defaults,noatime 0 0
/dev/disk/by-uuid/2cfab843-56d4-49ab-8e15-af99e68aa8cd /mnt/merger/d3 xfs defaults,noatime 0 0

/mnt/merger/d* /media fuse.mergerfs allow_other,minfreespace=25G,cache.files=per-process,category.create=pfrd,readahead=2048,category.search=newest,parallel-direct-writes=true,cache.attr=120,cache.entry=120,cache.readdir=true,cache.statfs=10,link_cow=true,xattr=noattr 0 0

Mount options with word-wrap: allow_other,minfreespace=25G,cache.files=per-process,category.create=pfrd,readahead=2048,category.search=newest,parallel-direct-writes=true,cache.attr=120,cache.entry=120,cache.readdir=true,cache.statfs=10,link_cow=true,xattr=noattr

Additional context As part of diagnosing the issue, I've rotated the underlying filesystems from ext4 -> xfs verified the disks have no bad sectors using dm_crypt and a full write of 0's then read them back with cmp -b. dmesg produces no output when I re-run the failing downloads

I'm happy to try adjusting various params....etc. to see if we can pin this down. I've held off changing the cache.files since POSIX mode also shows the same error too and this is an error I've seen randomly but with increasingly frequency

trapexit commented 9 months ago

Can you provide a more thorough trace of mergerfs? If mergerfs is getting EPERM ... generally it just needs to return that. Something is triggering that call. It is either coming from the client app directly or due to a clone path or file move (moveonenospc). I need to see the full trace to comment further.

trapexit commented 9 months ago

985   <... read resumed>"\200\0\0\0\4\0\0\0\374\23\327O\0\0\0\0(\3\17\0\0\0\0\0x\5\0\0x\5\0\0"..., 1052672) = 128
985   utimensat(42, NULL, [UTIME_OMIT, {tv_sec=1695652477, tv_nsec=659721176} /* 2023-09-25T14:34:37.659721176+0000 */], 0) = -1 EPERM (Operation not permitted)
985   writev(4, [{iov_base="\20\0\0\0\377\377\377\377\374\23\327O\0\0\0\0", iov_len=16}], 1) = 16

I'd need to see the exact request being made but I can't think of any situation where a message from the kernel would result in nothing but a utimensat except a setattr request.

DavidFair commented 9 months ago

The reproducer I'm using is to select an errored item and set it to redownload, there's a number of these items including new ones which all do the same thing.

I've attached the full strace for both: these both we're attaching to an existing process, attempting to resume the same one multiple times (to ensure it appeared) then stopping the trace: strace_qbt.txt.gz strace_merger.txt.gz

DavidFair commented 9 months ago

Doing my weekly update + reboot has apparently resolved this, despite having the issue for at least 3-4 weeks. I'm not sure what's changed (beyond maybe a kernel version?) but I'm going to close this, as it's clearly not mergerfs if it does come back.

Thanks for taking an initial look at the logs and sorry for any time lost.

trapexit commented 9 months ago

Would be good to understand what was going on there but probably not worth the effort to dig in right now. If it crops up again happy to take a look if I have the time.

trapexit / mergerfs

utimensat - EPERM on write #1257