trapexit / mergerfs

a featureful union filesystem
http://spawn.link
Other
4.31k stars 174 forks source link

MergerFS mount randomly disappears, only displays ??? when listed #1290

Closed gogo199432 closed 8 months ago

gogo199432 commented 10 months ago

Describe the bug

MergerFS mount seem to randomly disappear, and just give back "cannot access '/Storage': Input/output error" when trying to ls the filesystem root. At this point I need to restart the mergerfs service for it to reappear. However this means I have to re-export my NFS point, which in turn means I have to remount or restart my services that use it.

I'm having a really hard time narrowing down what could cause it, to the point that even now I don't have any idea why it happens. But it has been happening since I implemented MergerFS 1-2 months ago. For context my storage is on a Proxmox box, that runs one LXC container with my postgresql server, one VM for Jellyfin and several VMs that act as K3S nodes. The MergerFS mount is accessed through NFS in both the Jellyfin VM and in all K3S nodes.

I have 4 disks, all with EXT4 FS-s that are all mounted under /mnt as disk1-4 . These are then merged and mounted under /Storage.

To Reproduce

As mentioned it is really random, however a scheduled backup that runs at midnight in Proxmox seems to be the most reliable way. Weirdly even that fails at random points, sometimes it manages to completely save the backup and the mount dies after the backup ends and sends my notification email. But I had instances where it disappeared mid-process.

I also had it disappear while using Radarr or Sonarr to import media, but those are not a reliable way to reproduce I have found.

Expected behavior

Function as expected. Shouldn't disappear and break NFS

System information:

[Service] Type=simple KillMode=control-group ExecStart=/usr/bin/mergerfs \ -f \ -o cache.files=partial,moveonenospc=true,category.create=mfs,dropcacheonclose=true,posix_acl=true,noforget,inodecalc=path-hash,fsname=mergerfs \ /mnt/disk* \ /Storage ExecStop=/bin/fusermount -uz /Storage Restart=on-failure

[Install] WantedBy=default.target

 - List of drives, filesystems, & sizes:
   - `df -h`

Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 2.9M 3.2G 1% /run /dev/mapper/pve-root 28G 12G 15G 45% / tmpfs 16G 34M 16G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock efivarfs 128K 50K 74K 41% /sys/firmware/efi/efivars /dev/sdf 3.6T 28K 3.4T 1% /mnt/disk4 /dev/sde 3.6T 28K 3.4T 1% /mnt/disk3 /dev/sda 19T 2.2T 16T 13% /mnt/disk1 /dev/sdb 19T 1.3T 16T 8% /mnt/disk2 /dev/fuse 128M 20K 128M 1% /etc/pve tmpfs 3.2G 0 3.2G 0% /run/user/0 mergerfs 44T 3.4T 38T 9% /Storage

   - `lsblk -f`

NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS sda ext4 1.0 20T_disk_1 3bea15fe-0c62-42ad-bc73-727c7e6ed147 15.1T 12% /mnt/disk1 sdb ext4 1.0 20T_disk_2 9306d268-2f54-42f3-958b-d8555b470bf0 15.9T 7% /mnt/disk2 sdc ├─sdc1 ├─sdc2 vfat FAT32 EDFC-8E51 └─sdc3 LVM2_member LVM2 001 a9c81x-tUS5-CcN9-3w5u-ZF84-ODxd-r21cM9 ├─pve-swap swap 1 3a923fc4-0c8f-4ba3-92a8-b3515283e669 [SWAP] ├─pve-root ext4 1.0 ecd841a9-5d7b-4d70-a575-448fb85d8f51 14.2G 43% / ├─pve-data_tmeta │ └─pve-data-tpool │ └─pve-data └─pve-data_tdata └─pve-data-tpool └─pve-data sdd ├─sdd1 ext4 1.0 BigBackup 29c4fd80-9fc5-4d1d-a783-cba4372cffc0 └─sdd2 LVM2_member LVM2 001 pB5Bes-ARIU-XsLl-ryLc-Nw1A-Ofnj-OEzODe ├─vmdata-bigthin_tmeta │ └─vmdata-bigthin-tpool │ ├─vmdata-bigthin │ ├─vmdata-vm--101--disk--0 │ ├─vmdata-vm--102--disk--0 │ ├─vmdata-vm--103--disk--0 │ ├─vmdata-vm--104--disk--0 │ ├─vmdata-vm--105--disk--0 │ ├─vmdata-vm--111--disk--0 │ ├─vmdata-vm--107--disk--0 │ └─vmdata-vm--100--disk--1 ext4 1.0 a2234f63-38da-43fb-877a-a3e836f4004e └─vmdata-bigthin_tdata └─vmdata-bigthin-tpool ├─vmdata-bigthin ├─vmdata-vm--101--disk--0 ├─vmdata-vm--102--disk--0 ├─vmdata-vm--103--disk--0 ├─vmdata-vm--104--disk--0 ├─vmdata-vm--105--disk--0 ├─vmdata-vm--111--disk--0 ├─vmdata-vm--107--disk--0 └─vmdata-vm--100--disk--1 ext4 1.0 a2234f63-38da-43fb-877a-a3e836f4004e sde ext4 1.0 4T_disk_1 06207dd1-fc54-4faf-805d-a880dc432bc4 3.4T 0% /mnt/disk3 sdf ext4 1.0 4T_disk_2 2b06a9fa-901c-4b66-bfdc-8c7e4a09f21f 3.4T 0% /mnt/disk4

 - A strace of the application having a problem:
   Unable to provide due to how the command was run (scheduler)
 - strace of mergerfs while app tried to do it's thing:
 (logfile was too large, had to zip it)
[mergerfs.trace.zip](https://github.com/trapexit/mergerfs/files/13847594/mergerfs.trace.zip)

**Additional context**

My NFS export:
`/Storage *(rw,sync,fsid=0,no_root_squash,no_subtree_check,crossmnt)`

All disks have gone through a long selftest using smartctl and report no problems. Example output of the first 20TB disk:

SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed without error 00% 726 -

trapexit commented 10 months ago

When was the strace of mergerfs taken? While in that broken state? If so... mergerfs looks fine.

Preferably you would trace just before issuing a request to mergerfs mount and then trace the app you use to generate that error. mergerfs is a proxy. The kernel handles requests between the client app and mergerfs. There are lots of situations where the kernel short circuits the communication and therefore not sending anything to it. Which I suspect is happening (for whatever reason). mergerfs would only return EIO if the underlying filesystem did. And the kernel could return it for numerous reasons. And NFS and FUSE don't always play nice together.

Are you modifying the mergerfs pool on the host? Not through NFS?

gogo199432 commented 10 months ago

I started this trace right before I knew the backup would start in the hope that I would catch the issue. After checking it a minute or so later I saw that the mount disappeared and stopped the trace.

Not quite sure what you mean by modifying the pool on the host, but I didn't touch the system while the backup was running. In Proxmox you can add the backup target as both a Directory and NFS mount, but I tried both and didn't seem to have made a difference. During the trace it was set up as a Directory.

I'm pretty sure that I have tried killing NFS completely and even then the backup would randomly fail, but this was a bit ago, I'm not sure anymore.

trapexit commented 10 months ago

NFS does not like out of band changes. Particularly NFSv4. If you have a NFS export and then on the host you export from modify the filesystem straight... you can and will causes problems. It usually leads to stale errors but could depend on the situation.

Tracing as you did is fine but since you aren't providing a matching trace of anything trying to interact with the filesystem on the host (or through NFS) I can't pinpoint who is responsible for the error which is critical. Even if you traced mergerfs after the failure starts and then trace "ls" or "stat" accessing /Storage it would answer that question.

gogo199432 commented 10 months ago

So it finally crashed again, I managed to get both a strace with 'ls' and 'stat', hopefully this helps something.

ls.strace.txt mergerfs-ls.trace.txt

mergerfs-stat.trace.txt statfolder.strace.txt


On a related note, before I got my 2 big HDD-s I was running the 4TB disks on ZFS in a mirror and used the "sharenfs" toggle of ZFS to expose the folders I needed. The caveat there is that at least the top-most folders were all like separate Volumes or whatever ZFS calls them, so they were technically separate filesystems. I wonder if it wasn't breaking because of that? That's the only thing I can think of unless you can find something in the logs above.

trapexit commented 10 months ago

Hmm.. Well the "stat" clearly worked though you didn't stat the file you statfs'ed it. But the statfs clearly worked and you can see the request in mergerfs. The ls however failed with EIO when it tried to stat /Storage and I don't see any evidence of mergerfs receiving the request. However, that trace shows a file being read. Totally Spies. So something is able to interact with it.

Have you tried disabling all caching? If the kernel caches things it becomes more difficult to debug.

gogo199432 commented 10 months ago

It also fails with cache.files=off, but I can set it to that and do another 'ls' trace if that helps. Is there any other caching that I'm not aware of that I can disable?

trapexit commented 10 months ago

See the caching section of the docs. Turn off entry and attr too. And yes, could help to trace that setup.

gogo199432 commented 10 months ago

Got a new trace with the following mergerfs settings: -o cache.files=off,cache.entry=0,cache.attr=0,moveonenospc=true,category.create=mfs,dropcacheonclose=true,posix_acl=true,noforget,inodecalc=path-hash,fsname=mergerfs

ls.strace.txt mergerfs.trace.txt

gogo199432 commented 10 months ago

Is there anything more I can provide to help deduce the issue? It is still occuring sadly. The frequence seems to depend on how many applications are trying to use it. Also, the issue occured when using Jellyfin, so it also happens after a strictly read-only operation.

trapexit commented 10 months ago

If I knew I'd ask for it.

The kernel isn't forwarding the request.

44419 15:35:19.951799 statx(AT_FDCWD, "/Storage", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW|AT_NO_AUTOMOUNT, STATX_MODE|STATX_NLINK|STATX_UID|STATX_GID|STATX_MTIME|STATX_SIZE, 0x7ffcf234e990) = -1 EIO (Input/output error) <0.000006>

vs

33344 15:35:19.359147 writev(4, [{iov_base="`\0\0\0\0\0\0\0\264K\373\0\0\0\0\0", iov_len=16}, {iov_base="0\242R\266\2\0\0\0\252\372\304~\2\0\0\0\26\205\327[\2\0\0\0\0\300}A\0\0\0\0\17\312{A\0\0\0\0\0\20\0\0\377\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=80}], 2) = 96 <0.000007>
33344 15:35:19.359172 read(4,  <unfinished ...>
33343 15:35:20.129934 <... read resumed>"8\0\0\0\3\0\0\0\266K\373\0\0\0\0\0\t\0\0\0\0\0\0\0\350\3\0\0\350\3\0\0b\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 1052672) = 56 <1.623557>

The fact that read succeeds means some messages are coming in from the kernel. In the very least statfs requesting info on the mount. But there is no log indicating why the kernel behaves that way.

You have a rather complex setup. All I can suggest is making it less so. It's not feasible for me recreate what you have. Create a second pool. 1 branch. Keep everything simple. See if you can break it via standard tooling. There are many variables here. We can't keep using the full stack to debug if nothing obvious presents itself.

Janbong commented 9 months ago

Hi,

I'm experiencing a similar issue on the same platform as OP (as far as I can tell from his shared outputs): Proxmox VE. PVE version: pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.11-7-pve)

I used to have a VM under Promox running MergerFS (with an HBA with PCIe passthrough to the VM).

Last week I changed this so my Proxmox host runs MergerFS with the disks attached directly to the motherboard instead of using the HBA. (Server is way less power hungry that way.)

After a day or 2 I noticed the my services, who rely on the MergerFS mount, stopped working. ls of the mount resulted in Input/Output error. Killing MergerFS and running mount -a (I'm using a MergerFS line in my /etc/fstab) again is the only solution.

Today the problem occured again. I was using MergerFS version 2.33.5 from the debian repo's that are configured in Proxmox. I just updated to version 2.39.0 (latest release) but don't expect this to be the solution.

In my previous setup (with the VM), the last version I was using was `2.21.0'.

Any steps I can take to help us troubleshoot the problem? I love MergerFS, have been using it for years (in the VM setup I explained above). I'd like to keep using it but losing the mount every 2 days is not an option of course.

dmesg reports no issues with the relevant disks.

Thanks in advance for the help!

trapexit commented 9 months ago

Any steps I can take to help us troubleshoot the problem?

The same as I describe in this thread and the docs.

gogo199432 commented 9 months ago

Quick update from my side. I tried to reduce it to the best of my abilities. First of all, I made a second mount only for Proxmox backups. This mount never had any issues, so we can cross that out. Second, I modified my existing MergerFS mount so it only contained a single disk. In addition removed all services except for Jellyfin.

So setup was:

Disk 1 MergerFS mount -> Shared over NFS -> Jellyfin VM, mounted folders using fstab

This seemed the most stable, but still failed. When I clicked on Jellyfin to scan my library from zero it managed all the way to like 96% or something, and suddenly the mount vanished. This seems like the most "reliable" way to repro, managed to trigger it twice in a row. Still no luck in triggering it using any basic linux command.

Having said all this, after 3-4 months of this issue persisting and with no light at the end of the tunnel I decided to throw money at the problem, bought another disk and moved back over to ZFS. Would love to use MergerFS in the setup I described in the original post, but having my main storage disappear from my production systems is not a state to be in long-term.

trapexit commented 9 months ago

@Janbong And are you using NFS too?

@gogo199432 Are your mergerfs and NFS settings the same as the original post? Have you tried modifying NFS settings? Do you have other FUSE filesystems mounted? Are you doing any modifying of the export outside NFS?

Janbong commented 9 months ago

I'm also relying on NFS to let services running on VMs access the MergerFS mount. But that was also the case when MergerFS was running on one of the VMs, which worked without any issues for years .

trapexit commented 9 months ago

Yes, but was it the same kernel? Same NFS config? Same NFS version?

Janbong commented 9 months ago
trapexit commented 9 months ago

What were the versions? What are the settings?

gogo199432 commented 9 months ago

I was using the same settings as previously except for reducing to single disk. Apart from what we talked about with caching I have not been modifying the NFS settings. I tried removing the whole posix settings, but that has made no difference. If you mean another system apart from MergerFS that uses FUSE, then no I do not have anything else. As far as I can tell I have been doing no modifications outside of NFS, as I have said the only out-of-band thing was the backup but I moved that to a second MergerFS mount. So at the time of the issue with Jellyfin, there was nothing running on Proxmox that would access the mount.

Janbong commented 9 months ago

What were the versions? What are the settings?

Looking for a way to check the version. I think both are enabled looking at output of nfsstat -s:

root@pve1:~# nfsstat -s
Server rpc stats:
calls      badcalls   badfmt     badauth    badclnt
2625399    0          0          0          0

Server nfs v3:
null             getattr          setattr          lookup           access
6         0%     844071   32%     204       0%     797       0%     150711    5%
readlink         read             write            create           mkdir
0         0%     1206289  45%     2833      0%     88        0%     4         0%
symlink          mknod            remove           rmdir            rename
0         0%     0         0%     0         0%     0         0%     0         0%
link             readdir          readdirplus      fsstat           fsinfo
0         0%     0         0%     4324      0%     414536   15%     8         0%
pathconf         commit
4         0%     1503      0%

Server nfs v4:
null             compound
4         6%     54       93%

Server nfs v4 operations:
op0-unused       op1-unused       op2-future       access           close
0         0%     0         0%     0         0%     2         1%     0         0%
commit           create           delegpurge       delegreturn      getattr
0         0%     0         0%     0         0%     0         0%     36       24%
getfh            link             lock             lockt            locku
4         2%     0         0%     0         0%     0         0%     0         0%
lookup           lookup_root      nverify          open             openattr
2         1%     0         0%     0         0%     0         0%     0         0%
open_conf        open_dgrd        putfh            putpubfh         putrootfh
0         0%     0         0%     34       23%     0         0%     8         5%
read             readdir          readlink         remove           rename
0         0%     0         0%     0         0%     0         0%     0         0%
renew            restorefh        savefh           secinfo          setattr
0         0%     0         0%     0         0%     0         0%     0         0%
setcltid         setcltidconf     verify           write            rellockowner
0         0%     0         0%     0         0%     0         0%     0         0%
bc_ctl           bind_conn        exchange_id      create_ses       destroy_ses
0         0%     0         0%     4         2%     2         1%     2         1%
free_stateid     getdirdeleg      getdevinfo       getdevlist       layoutcommit
0         0%     0         0%     0         0%     0         0%     0         0%
layoutget        layoutreturn     secinfononam     sequence         set_ssv
0         0%     0         0%     4         2%     44       30%     0         0%
test_stateid     want_deleg       destroy_clid     reclaim_comp     allocate
0         0%     0         0%     2         1%     2         1%     0         0%
copy             copy_notify      deallocate       ioadvise         layouterror
0         0%     0         0%     0         0%     0         0%     0         0%
layoutstats      offloadcancel    offloadstatus    readplus         seek
0         0%     0         0%     0         0%     0         0%     0         0%
write_same
0         0%

Settings:

root@pve1:~# cat /etc/exports
# /etc/exports: the access control list for filesystems which may be exported
#               to NFS clients.  See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes       hostname1(rw,sync,no_subtree_check) hostname2(ro,sync,no_subtree_check)
#
# Example for NFSv4:
# /srv/nfs4        gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check)
# /srv/nfs4/homes  gss/krb5i(rw,sync,no_subtree_check)
#
/mnt/storage redacted_ip/24(rw,async,no_subtree_check,fsid=0) redacted_ip/8(rw,async,no_subtree_check,fsid=0) redacted_ip/24(rw,async,no_subtree_check,fsid=0)
/mnt/seed redacted_ip/24(rw,async,no_subtree_check,fsid=1) redacted_ip/8(rw,async,no_subtree_check,fsid=1) redacted_ip/24(rw,async,no_subtree_check,fsid=1)
Janbong commented 9 months ago

Client side reporting that it's using NFSv3:

21:02:00 in ~ at k8s ➜ cat /proc/mounts | grep nfs
fs1.bongers.lan:/mnt/storage /mnt/storage nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,na
.4.90,mountvers=3,mountport=51761,mountproto=udp,local_lock=none,addr=redacted_ip
fs1.bongers.lan:/mnt/seed /mnt/seed nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=2
mountvers=3,mountport=51761,mountproto=udp,local_lock=none,addr=redacted_ip
trapexit commented 9 months ago

The pattern I'm seeing is Proxmox + NFS. While I certainly won't rule out a mergerfs bug the fact everyone who experiences this seems to have that setup suggests it could be the proxmox kernel. I'm going to reach out to the fuse community to see if anyone has any ideas.

trapexit commented 9 months ago

Some notes: 1) I setup a dedicated machine running latest Armbian (x86), mounted a SSD formatted with ext4, put a mergerfs mount over it, exported it and from another machine mounted that export. I used a stress tool I wrote (bbf) to hammer the NFS mount for several hours. No issues. 2) I will setup Proxmox and attempt the same. If the stress test fails I'll try adding some media and installing Plex I guess. For some reason Proxmox doesn't like my extra machine I have been testing with so I'll try with a VM i guess. 3) I think what is happening is that NFS is getting into a bad state and causing mergerfs (or kernel side of the relationship) to get into a bad state due to ... something... maybe metadata changing under it in a way it doesn't like. For the root of the mount. And then that "sticks" and therefore no new requests can be made because the root is lookup is failing. I was under the impression that the kernel does not keep forever that error state but maybe I'm wrong or maybe it is triggering something different than normal. Either way I will probably need to fabricate errors to see if it behaves similarly to what you all are seeing.

All that said: NFS and FUSE just do not play nicely together. There are fundamental issues with how the two interact. Even if that were fixed it would still be complicated to get it working flawlessly on my end. I've gotten some feedback from kernel devs on the topic and will write something up about it later once I test some things out but I think after this I'm going to have to recommend people not use NFS or at least "you're on your own". I'll offer suggestions on setup to minimize issues but at the end of the day 100% support is not likely.

trapexit commented 9 months ago

Looking over the kernel code I've narrowed down situations where EIO errors can be returned and are "sticky". I think I can add some debug code to detect the situation and then maybe we can figure out what is happening.

Janbong commented 9 months ago

I appreciate you looking into this this thoroughly. Thanks!

In the mean time, the problem occurred two more times on my end.

I have disks spinning down automatically. I found in the logs that some time before the error occurred, all disks started spinning up, which leads me to believe some kind of scan was initiated. Plex scan possible, Sonarr/Radarr scan. Something like that probably. Which reads a lot of files in a short period of time. Not sure how long before the actual error occurred as I am basing myself of off the a timestamp when I got a message from someone using a service and it suddenly stopped working for them.

To be completely transparent on how I set things up: my (3) data disks are XFS, my SnapRAID parity is ext4. Though the parity disk is not really relevant here, I guess, since it is not part of the MergerFS mount.

I get what you are saying about FUSE + NFS not working together nicely. Still I was running my previous (similar) setup for years without ANY issues at all. My bet is Proxmox doing something MergerFS can't handle actually. It's a little too coincidental that @gogo199432 is also running MergerFS on Proxmox and running into the same issue.

My previous setup was MergerFS on a Ubuntu VM, albeit with a HBA in between as opposed to directly connected drives to the motherboard using SATA in my current setup on Proxmox.

trapexit commented 9 months ago

What is your mergerfs config? And are you changing things out of band? Those two errors are pretty different things.

trapexit commented 9 months ago

I get what you are saying about FUSE + NFS not working together nicely. Still I was running my previous (similar) setup for years without ANY issues at all.

Yes, but both mergerfs and the kernel and NFS code had evolved. The kernel code has gotten more strict about certain security concerns in and after 5.14.

If the kernel marks a node bad (literally called fuse_make_bad(inode)) there is nothing I can do. Hence why I'm trying to understand how NFS is triggering the root to be marked as such. Because this is not a unique situation. NFS shouldn't cause an issue any more than normal use. If there is a bug in the kernel that is leading to this I likely can't do much about it till addressed by the kernel devs. Or Proxmox updates their kernel.

Janbong commented 9 months ago

my /etc/fstab (MergerFS settings):

/mnt/disk* /mnt/storage fuse.mergerfs direct_io,defaults,allow_other,noforget,use_ino,minfreespace=50G,fsname=mergerfs 0 0
trapexit commented 9 months ago

@Janbong As mentioned in the docs direct_io, allow_other, and use_ino are deprecated. And for NFS usage you should really be setting inodecalc=path-hash for consistent inodes.

trapexit commented 9 months ago

Would it be possible to try building the nfsdebug branch? Or tell me what version of debian proxmox is based on and I can build packages.

I put in debugging info that will be printed to syslog (journalctl) in cases that the kernel would normally mark things as errored.

Go0oSer commented 9 months ago

I presently don't have any trace logs to add, but I wanted to inform those here that I am also having the same issue. I am also using Proxmox with NFS. Specifically Im using mergerfs in a Ubuntu vm and passing in the HBA PCIE device from the hypervisor to the VM. My mergerfs mount has been EIO'ing more frequently lately. It was fairly stable for a while before sometime in December. Im not reliable able to trigger the EIO.

I also had opened issue 1004 in the past for a similar issue - albeit I did not at the time report that I was using Proxmox and NFS. In that issue we somewhat determined that ballooning RAM and KVM were likely at fault.

Ive gone ahead and disabled ballooning RAM for the time being and will report back. If you would like for me to gather any other info, please let me know.

trapexit commented 9 months ago

1004 was a different issue (afaict). Though still uncertain why it happens. In that case the kernel was asking for something that isn't ever supposed to happen. The parent of the root. I "fixed" that by returning that it doesn't exist. Perhaps that is causing this EIO but since I can't reproduce the situation I really can't say.

I think the best we can do right now is have you all run a special build that I've instrumented to log when these odd situations occurred.

I failed to install Proxmox 8.1 on a random PC I have laying around (doesn't like the emmc) so I'm going to try with an old laptop. Then I can build a deb for you all to try.

Specifically Im using mergerfs in a Ubuntu vm and passing in the HBA PCIE device from the hypervisor to the VM.

That's not what is being done by the others in this thread... unless we aren't clearly communicating. They are running mergerfs on Proxmox.

I think I need to build a tool to query a user's host and config because it is just too difficult to get everyone to explain their setup.

Go0oSer commented 9 months ago

Apologies. I didn't mean to muddy the waters.

trapexit commented 9 months ago

mergerfs_2.39.0-2-g9123d05~debian-bookworm_amd64.deb.zip

Can you try this (those using Proxmox)? And when you run into the EIO issue give me the output from journalctl -xb -t mergerfs

trapexit commented 9 months ago

I've been able to reproduce. Not easily but given enough clients and enough requests it happens after some time. Interestingly not seeing anything in the log from the debugging I put into place. I've got it setup to only 1 branch with path-hash inode calc and all caching disabled. I need to see if I can make the reproduction even simpler and add some more logging.

yeyeoke commented 9 months ago

Thank god, someone else is having this issue as well. This has been driving me crazy for months now.

I've got 3 Proxmox Nodes running amongst others 1 Debian node, with drives combined using MergerFS with the following entry in /etc/fstab:

/mnt/disks/storage/* /mnt/storage fuse.mergerfs fsname=hdd-pool,func.getattr=newest,noforget,cache.files=off,dropcacheonclose=true,category.create=mfs,moveonenospc=true,minfreespace=25G 0 0

The Mergerfs pool is then shared to the following machines:

1 VM Running Proxmox Backup Server 1 VM Running a media-stack in docker, jellyfin plex etc. 1 VM Running Qbittorrent, sonarr and radarr etc.

Here is the weird part; the media-vm keeps on working when I get notified that the NFS-VM is experiencing input/output-error on /mnt/storage, the Qbittorrent-VM also experiences these problems, but once again, Sonarr, Plex and Jellyfin all still have access to the NFS-share.

Specs for the PBS-VM:

System: Distro......: Debian GNU/Linux 12 (bookworm) Kernel......: Linux 6.5.11-8-pve

Specs for all the other machines:

System: Distro......: Debian GNU/Linux 12 (bookworm) Kernel......: Linux 6.1.0-18-amd64

Mergerfs-version: v2.39.0 NFS-version used by the clients: v4.2

trapexit commented 9 months ago

@yeyeoke

If you are exporting mergerfs through NFS you need to follow the instructions in the doc otherwise you absolutely will run into issues. Those, however, are different from what is being discussed here.

I'm writing up more detailed and explicit docs on the topic given the sudden increase in reports and possibly a workaround/fix. Might as well wait till I post those. Hopefully later today.

trapexit commented 9 months ago

I've gotten possible confirmation that this is a kernel bug. The author of FUSE has said:

This is really weird. It's a kernel bug, no arguments, because kernel should never send a forget against the root inode. But that lookup(nodeid=1, name=..); already looks bogus.

IE... what is happening is that the kernel is asking mergerfs for details on the parent of the root node. Which doesn't make much sense as the root node is... the root :) Currently I see that and return ENOENT since before the code didn't check at all because as far as I understood it should never ask that and it led to a crash. But with that it results in that EIO errors you are all seeing.

There is some debate if this is truly "bogus". We'll see.

If it is a kernel issue I'm not sure I'll be able to provide any workarounds. We'll have to see what the kernel devs find. If I can work around it I'll do so. If not... might just need to use something besides NFS or upgrade your kernel once available.

trapexit commented 9 months ago

https://github.com/trapexit/mergerfs/releases/tag/2.40.0

Try this with export-support=false

Janbong commented 9 months ago

Just got back from a holiday. Happy to see progress made in diagnosing the issue. Will try the workaround tomorrow. And will read the docs in how to setup MergerFS with NFS because I totally missed that.

Thanks @trapexit. Greatly appreciated!

mitzsch commented 9 months ago

Just want to peak in. I have a somewhat similar issue, at least it shares some similarities with the one here.

On my end after a heavy filesystem activity task (in my case snapraid scrub or plex doing its thing, not very easy to reproduce) the network connection completely frozes. To be precise the i40e driver "crashes", the igb driver/connection remains functional.

The server (dell r730xd with md1200 disk shelf, two e5 2640v4, 32gb ram, ubuntu 22.04.4 - kernel 6.5.0-18.18) is then not handling any commands over the network. Sometimes it's accessible, and sometimes the connection collapses. An iperf3 test would show one or two successful 10g transfers and then only 0gbit transfers until the SFP connector is pulled and inserted again... However, the disks and their filesystem on the server remain functional. I can copy stuff from the array to a different drive... When I understand it correctly this is not the case with this issue? I also don´t have an NFS export, however, the nfs utils are installed and loaded into the kernel...

Is it possible that I´m also hitting the same issue and in my case, its manifesting as a broken (i40e) network state?

When its happening the next time, I will create a strace log... Maybe its also throwing out I/O errors...

trapexit commented 9 months ago

Is it possible that I´m also hitting the same issue and in my case, its manifesting as a broken (i40e) network state?

Sounds like a different issue entirely. This issue has to do explicitly with NFS and FUSE.

trapexit commented 9 months ago

While there is a bug in the kernel it wasn't exclusively the issue. It looks like 5.14 kernel added a check for certain conditions that can happen outside of using with NFS but are far far more likely with NFS. While mergerfs' code had been the same for years with regard to this stuff the fact people are moving to kernels past 5.14 is leading to problems. I'll have a release out shortly with a fix. It seems to work regardless of the kernel issue.

mitzsch commented 9 months ago

It looks like 5.14 kernel added a check for certain conditions that can happen outside of using with NFS but are far far more likely with NFS.

Okay, so my issue might actually be due to the same "bug-family"... (high IO causing FUSE problems?)

trapexit commented 9 months ago

No, you aren't using NFS. You say your network is going out. That is not the issue this thread is about. Why do you believe the network going out is related to NFS or FUSE or mergerfs?

shocker2 commented 9 months ago

Small update regarding the 1st release with export-support false, it's not usable and it's worse than before. Random NFS mounts get stale and they are not clearing out with unmount/mount. It will require an OS reboot to clean, and this happens every few hours (I'm not complaining, just sharing the feedback). Waiting for the new release, thank you @trapexit for all the hard work on this!

mitzsch commented 9 months ago

Why do you believe the network going out is related to NFS or FUSE or mergerfs

Hm, its just a theory of mine. I mean I´m also running a mergerfs pool/mount and after high I/O activity through the array (= plex) and/or directly to the disks (= snapraid) weird behavior occurs. In my case it renders network access unusable. (My (smb) shares become unresponsive, as well as any other network-related app). Also there is no dmesg or syslog output when the issue arises.

It seems like the issue described here - with the only difference that NFS is not involved. But as you wrote the kernel bug can also be triggered outside of NFS - even though not that likely.

Anyway, I will keep an eye on my issue, whenever its happening again, I will do a strace, do some further analysis and report back. (here or in a different bug report, or wherever it belongs...)

Sorry, for the inconvenience!

trapexit commented 9 months ago

It seems like the issue described here - with the only difference that NFS is not involved. But as you wrote the kernel bug can also be triggered outside of NFS - even though not that likely.

But it doesn't. This about mergerfs becoming not unresponsive but returning EIO errors. It isn't that the network dies but NFS and mergerfs locally return EIO errors when accessed. That is not what you've described. And you also talk about snapraid which is entirely and fully unrelated to mergerfs or network filesystems or even the network.

trapexit commented 9 months ago

https://github.com/trapexit/mergerfs/releases/tag/2.40.1

shocker2 commented 9 months ago

Thanks, already compiling it.