Closed DavidFair closed 9 months ago
Can you provide a more thorough trace of mergerfs? If mergerfs is getting EPERM ... generally it just needs to return that. Something is triggering that call. It is either coming from the client app directly or due to a clone path or file move (moveonenospc). I need to see the full trace to comment further.
985 <... read resumed>"\200\0\0\0\4\0\0\0\374\23\327O\0\0\0\0(\3\17\0\0\0\0\0x\5\0\0x\5\0\0"..., 1052672) = 128
985 utimensat(42, NULL, [UTIME_OMIT, {tv_sec=1695652477, tv_nsec=659721176} /* 2023-09-25T14:34:37.659721176+0000 */], 0) = -1 EPERM (Operation not permitted)
985 writev(4, [{iov_base="\20\0\0\0\377\377\377\377\374\23\327O\0\0\0\0", iov_len=16}], 1) = 16
I'd need to see the exact request being made but I can't think of any situation where a message from the kernel would result in nothing but a utimensat except a setattr request.
The reproducer I'm using is to select an errored item and set it to redownload, there's a number of these items including new ones which all do the same thing.
I've attached the full strace for both: these both we're attaching to an existing process, attempting to resume the same one multiple times (to ensure it appeared) then stopping the trace: strace_qbt.txt.gz strace_merger.txt.gz
Doing my weekly update + reboot has apparently resolved this, despite having the issue for at least 3-4 weeks. I'm not sure what's changed (beyond maybe a kernel version?) but I'm going to close this, as it's clearly not mergerfs if it does come back.
Thanks for taking an initial look at the logs and sorry for any time lost.
Would be good to understand what was going on there but probably not worth the effort to dig in right now. If it crops up again happy to take a look if I have the time.
Describe the bug
Using QBT with both mmap and POSIX modes fails to write with
EPERM
, it appears to write some data to the FD before failing out each time so there's clearly write perms for the contents at least:Running strace against mergerfs results in the following
From the application POV the
pwrite64
call fails with EPERM instead (which is what makes me think this is possibly a mergerfs issue):I can see the perms are correct on both the mergerfs mount point, and all filesystems underneath showing the same UID/GIDs, and the process is running as
containers
as expected:To Reproduce
In the container, as the user
abc
with the same UID and GID I can do the following in the same directorySo I think this is some interaction between a specific syscall and mergerfs, since QBT isn't requesting
utimensat
but mergerfs does. However, I'm not sure how to diagnose it further....Expected behavior
The operation does not fail with
EPERM
System information:
Mount options with word-wrap:
allow_other,minfreespace=25G,cache.files=per-process,category.create=pfrd,readahead=2048,category.search=newest,parallel-direct-writes=true,cache.attr=120,cache.entry=120,cache.readdir=true,cache.statfs=10,link_cow=true,xattr=noattr
Additional context As part of diagnosing the issue, I've rotated the underlying filesystems from ext4 -> xfs verified the disks have no bad sectors using
dm_crypt
and a full write of 0's then read them back withcmp -b
.dmesg
produces no output when I re-run the failing downloadsI'm happy to try adjusting various params....etc. to see if we can pin this down. I've held off changing the
cache.files
sincePOSIX
mode also shows the same error too and this is an error I've seen randomly but with increasingly frequency