trapexit / mergerfs

a featureful union filesystem
http://spawn.link
Other
4.15k stars 169 forks source link

mergerfs gets killed on heavy reads #727

Closed nabnux closed 4 years ago

nabnux commented 4 years ago

General description

I run mergerfs on my debian 10 VM to aggregate a 12TB and a 14TB disk. On this box also runs a Deluge daemon with ~200 torrents as well as a Plex Media Server.

The problem is that the mergerfs process gets randomly killed, which seems to happen when there's a lot of reads on the filesystem.

Expected behavior

mergerfs process should not be killed when I trigger a storage recheck on my torrents, or when Plex scans new media.

Actual behavior

The mergerfs process gets killed by SIGABRT, resulting in the Transport endpoint not connected message when trying to access the filesystem. No relevant logs are available in dmesg or journalctl.

Precise steps to reproduce the behavior

The most consistent way I've found to trigger the process killing is to force a storage recheck on a big torrent (tens of GB) in Deluge.

System information

Please provide as much of the following information as possible:

This version has been built from the latest git commit to get the debug symbols, but I also encountered this issue with the packages from the stable debian repo (2.24.2-4) and the testing repo (2.28.1-1)

This is a virtual machine running under KVM, the hypervisor is also running the same Debian version and kernel.

Tell me if you need more info or context. Thanks.

trapexit commented 4 years ago

Thank you for the thorough report. Unfortunately this might be tough to track down. Many many people, including myself, are using mergerfs 24/7 under heavy load. I've over 1000 torrents running at 40MB/s+ constantly.

It might be worth changing the makefiles (in main and libfuse dirs) to be -O0 vs -O2. Maybe change some settings. Turn off caching. Maybe enable splice_read / splice_write.

nabnux commented 4 years ago

Thanks for your answer. I know that this may be very specific to my setup.

While rebuilding with -O0 did not help or gave more info in the debug trace, disabling cache with cache.files=off seems to solve the issue, I'm currently force rechecking all my torrents without problems so far. Do you think this option could cause issues with mmap and Deluge, as stated in the FAQ ?

I will also try cache.files=libfuse and fiddling with splice option, and keep you updated.

trapexit commented 4 years ago

gave more info in the debug trace

It should provide more info in the stack trace. A lot is optimized out with O2.

I will also try cache.files=libfuse and fiddling with splice option, and keep you updated.

The default for cache.files is libfuse.

Curious that cache.files=off fixed things. There is very very little code that changes depending on that value. It's mostly a kernel thing. It could be happenstance. I'll scan over that code just in case.

Does Deluge use mmap? I believe I've tried it before and it worked fine on my system with caching disabled. If it does use mmap it must have a fallback to using standard io.

nabnux commented 4 years ago

Quick update: I encountered the issue even with cache.files=off, but it only happened once during the recheck of ~11TB of data so I'm okay with that. Unfortunately I was not running mergerfs under gdb this time so no trace.

Enabling any other value for files cache makes it crash much faster, and splice options did not change that behavior. Neither did switching the IO driver of the VM from virtio to ide.

Does Deluge use mmap? I believe I've tried it before and it worked fine on my system with caching disabled. If it does use mmap it must have a fallback to using standard io.

I confirm that Deluge works well with cache disabled.

I'll paste more useful info here if I get any. Feel free to close this issue anytime.

trapexit commented 4 years ago

I'll leave it open till we figure something out. I realize it might be a PITA but perhaps creating another VM with the same OS install? Or maybe run some RAM tests on the host machine? Over the past few years I've seen some really weird stuff that somehow manifest through mergerfs. Had someone with bad RAM (confirmed with a memtest86)... their system seemed absolutely fine otherwise even under heavy load. Had someone with a bad CPU that similarly would lead to crashes with mergerfs alone. Swapped out with an identical CPU and everything was fine. Not saying it's a hardware problem... but if we are running out of ideas it might be worth it.

It's not to say there isn't a mergerfs bug. It's totally possible that your setup is tickling something and most of us are luckily not. But if it is only your system able to trigger it then I'm probably not going to be able to help much. If you can recreate it in another VM then perhaps you could share it with me.

jeffgt14 commented 4 years ago

I have a similar configuration as you and was getting the same error constantly since I updated to mergerfs 2.29. I upgraded my kernel from 4.19 to 5.24 and it's been running great for a few days now. Not sure what got resolved if anything but that was my solution.

nabnux commented 4 years ago

@jeffgt14 thanks for the feedback, I should have thought about upgrading the kernel. I guess you meant kernel version 5.4 ? I did the upgrade, crossing my fingers now. Can I ask if you're also running mergerfs in a VM ?

@trapexit if that does not solve my issue I'll do a memtest on the host server, and create another VM if needed. CPU swap won't be an option here unfortunately :)

trapexit commented 4 years ago

There is very little that changed between 2.28.3 and 2.29.0. Looking over the code I don't see anything that would lead to this. And I don't think the kernel should be impacting anything. If this is a bug it's probably something that's been there. I'm trying to reproduce this but no luck so far.

trapexit commented 4 years ago

I've a couple instances of find -type f -print -exec dd if={} of=/dev/null bs=1M status=progress \; running against a 2.29.0 instance since last night. Humming away just fine :-/

jeffgt14 commented 4 years ago

@nabnut yes sorry that should say kernel 5.4. I'm not running in a VM so nothing in common there but do have several torrents running and a subsonic server constantly scanning over my filesystem. I wish I could help more debugging anything, I just went straight to updating the kernel because I've been meaning to do it anyways and haven't had any issues since.

mergerfs version: 2.29.0 FUSE library version: 2.9.7-mergerfs_2.29.0 fusermount version: 2.9.7-mergerfs_2.29.0 using FUSE kernel interface version 7.31

uname -a

Linux dingle-server 5.4.24-1-MANJARO #1 SMP PREEMPT Thu Mar 5 20:29:25 UTC 2020 x86_64 GNU/Linux

mergerfs settings:

/mnt/data/disk1:/mnt/data/disk2:/mnt/data/disk3 /mnt/storage fuse.mergerfs defaults,use_ino,allow_other,noforget,cache.files=auto-full,cache.open=1,dropcacheonclose=true,ignorepponrename=true,cache.readdir=true,cache.statfs=60,minfreespace=6G,moveonenospc=true,cache.symlinks=true,fsname=mergerfs,category.create=mfs,func.getattr=newest 0 0

Transmission:

transmission-daemon 2.94

Airsonic:

Version 10.4.0-RELEASE – July 13, 2019 Server Apache Tomcat/8.5.42, java 1.8.0_242, Linux (273.7 MB / 914.5 MB)

I am also running an NFS server as well so I do get stale mounts every once in a while on the client side, but this issue was specifically about errors on the server.

nabnux commented 4 years ago

So with the latest 5.4 kernel available on Debian and cache.files=auto-full mergerfs was only killed once in two weeks, like with the older kernel and cache disabled. Not sure we can get to any conclusion with this.

However I've been bumping into another problem in parallel: sometimes the mergerfs mountpoint gets unresponsive, any access gets stuck forever (for example a simple ls). There are some kernel messages, not sure if related:

[Fri Apr  3 21:17:38 2020] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)                                                                 [15/1917][Fri Apr  3 21:17:38 2020] BUG: unable to handle page fault for address: ffff9099b6cb4b00
[Fri Apr  3 21:17:38 2020] #PF: supervisor instruction fetch in kernel mode          
[Fri Apr  3 21:17:38 2020] #PF: error_code(0x0011) - permissions violation           
[Fri Apr  3 21:17:38 2020] PGD 12b801067 P4D 12b801067 PUD 13b356063 PMD 13675b063 PTE 8000000136cb4163
[Fri Apr  3 21:17:38 2020] Oops: 0011 [#6] SMP NOPTI                                                                                                                      [Fri Apr  3 21:17:38 2020] CPU: 2 PID: 431 Comm: mergerfs Tainted: G      D W         5.4.0-4-amd64 #1 Debian 5.4.19-1
[Fri Apr  3 21:17:38 2020] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[Fri Apr  3 21:17:38 2020] RIP: 0010:0xffff9099b6cb4b00                                                                                                                   
[Fri Apr  3 21:17:38 2020] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00                                                                                                                  
[Fri Apr  3 21:17:38 2020] RSP: 0018:ffffa7c540657c30 EFLAGS: 00010286                                                                                                    
[Fri Apr  3 21:17:38 2020] RAX: ffff9099b6cb4b00 RBX: ffff9099b63da370 RCX: 0000000000000000
[Fri Apr  3 21:17:38 2020] RDX: 0000000000000000 RSI: ffffa7c540b9bcd0 RDI: ffff9099b9cec600           
[Fri Apr  3 21:17:38 2020] RBP: ffff9099b63da360 R08: ffff9099b63da3c0 R09: ffffa7c540657bd0
[Fri Apr  3 21:17:38 2020] R10: 0000000000001000 R11: ffffa7c540b9bd18 R12: ffff9099b9cec600
[Fri Apr  3 21:17:38 2020] R13: ffff9099b6cb4b00 R14: ffff9099b9cbcec0 R15: ffff9099b63da360
[Fri Apr  3 21:17:38 2020] FS:  00007fbbaeb05700(0000) GS:ffff9099bba80000(0000) knlGS:0000000000000000
[Fri Apr  3 21:17:38 2020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033         
[Fri Apr  3 21:17:38 2020] CR2: ffff9099b6cb4b00 CR3: 0000000138cf2000 CR4: 00000000003406e0
[Fri Apr  3 21:17:38 2020] Call Trace:
[Fri Apr  3 21:17:38 2020]  ? fuse_request_end+0xbc/0x1f0 [fuse]                     
[Fri Apr  3 21:17:38 2020]  ? fuse_dev_do_write+0x25e/0xde0 [fuse]                   
[Fri Apr  3 21:17:38 2020]  ? ext4_da_write_end+0xbe/0x2d0 [ext4]                    
[Fri Apr  3 21:17:38 2020]  ? copyin+0x28/0x30                                       
[Fri Apr  3 21:17:38 2020]  ? iov_iter_copy_from_user_atomic+0xc3/0x370              
[Fri Apr  3 21:17:38 2020]  ? fuse_dev_write+0x53/0x90 [fuse]                        
[Fri Apr  3 21:17:38 2020]  ? do_iter_readv_writev+0x158/0x1d0                       
[Fri Apr  3 21:17:38 2020]  ? do_iter_write+0x7d/0x190                               
[Fri Apr  3 21:17:38 2020]  ? vfs_writev+0xa6/0xf0                                   
[Fri Apr  3 21:17:38 2020]  ? do_writev+0x6b/0x110                                   
[Fri Apr  3 21:17:38 2020]  ? do_syscall_64+0x52/0x160                               
[Fri Apr  3 21:17:38 2020]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9               
[Fri Apr  3 21:17:38 2020] Modules linked in: kvm_amd ccp rng_core kvm irqbypass crct10dif_pclmul crc32_pclmul nft_ct nf_conntrack ghash_clmulni_intel nf_defrag_ipv6 nf_d
efrag_ipv4 libcrc32c fuse aesni_intel nft_counter crypto_simd joydev virtio_balloon evdev cryptd glue_helper serio_raw pcspkr button qemu_fw_cfg nf_tables nfnetlink sunrp
c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic dm_mod ata_generic virtio_blk e1000 psmouse ehci_pci uhci_hcd ehci_hcd ata_piix libata i2c_piix4 usbco
re crc32c_intel scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[Fri Apr  3 21:17:38 2020] CR2: ffff9099b6cb4b00
[Fri Apr  3 21:17:38 2020] ---[ end trace f5fa055ba08acd39 ]---
[Fri Apr  3 21:17:38 2020] RIP: 0010:0xffff9099b6cb1e00
[Fri Apr  3 21:17:38 2020] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00
[Fri Apr  3 21:17:38 2020] RSP: 0018:ffffa7c54065fc30 EFLAGS: 00010286
[Fri Apr  3 21:17:38 2020] RAX: ffff9099b6cb1e00 RBX: ffff9099b315e880 RCX: 0000000000000000
[Fri Apr  3 21:17:38 2020] RDX: 0000000000000000 RSI: ffffa7c540b83cd0 RDI: ffff9099b9cec600
[Fri Apr  3 21:17:38 2020] RBP: ffff9099b315e870 R08: ffff9099b315e8d0 R09: ffffa7c54065fbd0
[Fri Apr  3 21:17:38 2020] R10: 0000000000001000 R11: ffffa7c540b83d18 R12: ffff9099b9cec600
[Fri Apr  3 21:17:38 2020] R13: ffff9099b6cb1e00 R14: ffff9099b9cbcec0 R15: ffff9099b315e870
[Fri Apr  3 21:17:38 2020] FS:  00007fbbaeb05700(0000) GS:ffff9099bba80000(0000) knlGS:0000000000000000
[Fri Apr  3 21:17:38 2020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Apr  3 21:17:38 2020] CR2: ffff9099b6cb4b00 CR3: 0000000138cf2000 CR4: 00000000003406e0

mergerfs process:

# ps auxww | grep merger
root       425  8.6  0.3 457092 15296 ?        S<s  Apr02 279:05 mergerfs /mnt/data/vd* /home/deluge -o rw,allow_other,use_ino,cache.files=auto-full,func.getattr=newest,dropcacheonclose=true,fsname=mergerfs,dev,suid

strace last line (don't have the rest):

# strace -p 425
strace: Process 425 attached
futex(0x7ffd12c88fb0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

(I didn't do the memtest yet because I'm lazy :expressionless: )

trapexit commented 4 years ago

Sounds like a kernel bug. There really isn't much I can do but perhaps report it to the kernel maintainer.

On 4/3/2020 11:50 PM, nabnut wrote:

So with the latest 5.4 kernel available on Debian and |cache.files=auto-full| mergerfs was only killed once in two weeks, like with the older kernel and cache disabled. Not sure we can get to any conclusion with this.

However I've been bumping into another problem in parallel: sometimes the mergerfs mountpoint gets unresponsive, any access gets stuck forever (for example a simple |ls|). There are some kernel messages, not sure if related:

|[Fri Apr 3 21:17:38 2020] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [15/1917][Fri Apr 3 21:17:38 2020] BUG: unable to handle page fault for address: ffff9099b6cb4b00 [Fri Apr 3 21:17:38 2020] #PF: supervisor instruction fetch in kernel mode [Fri Apr 3 21:17:38 2020] #PF: error_code(0x0011) - permissions violation [Fri Apr 3 21:17:38 2020] PGD 12b801067 P4D 12b801067 PUD 13b356063 PMD 13675b063 PTE 8000000136cb4163 [Fri Apr 3 21:17:38 2020] Oops: 0011 [#6] SMP NOPTI [Fri Apr 3 21:17:38 2020] CPU: 2 PID: 431 Comm: mergerfs Tainted: G D W 5.4.0-4-amd64 #1 Debian 5.4.19-1 [Fri Apr 3 21:17:38 2020] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [Fri Apr 3 21:17:38 2020] RIP: 0010:0xffff9099b6cb4b00 [Fri Apr 3 21:17:38 2020] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 [Fri Apr 3 21:17:38 2020] RSP: 0018:ffffa7c540657c30 EFLAGS: 00010286 [Fri Apr 3 21:17:38 2020] RAX: ffff9099b6cb4b00 RBX: ffff9099b63da370 RCX: 0000000000000000 [Fri Apr 3 21:17:38 2020] RDX: 0000000000000000 RSI: ffffa7c540b9bcd0 RDI: ffff9099b9cec600 [Fri Apr 3 21:17:38 2020] RBP: ffff9099b63da360 R08: ffff9099b63da3c0 R09: ffffa7c540657bd0 [Fri Apr 3 21:17:38 2020] R10: 0000000000001000 R11: ffffa7c540b9bd18 R12: ffff9099b9cec600 [Fri Apr 3 21:17:38 2020] R13: ffff9099b6cb4b00 R14: ffff9099b9cbcec0 R15: ffff9099b63da360 [Fri Apr 3 21:17:38 2020] FS: 00007fbbaeb05700(0000) GS:ffff9099bba80000(0000) knlGS:0000000000000000 [Fri Apr 3 21:17:38 2020] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Fri Apr 3 21:17:38 2020] CR2: ffff9099b6cb4b00 CR3: 0000000138cf2000 CR4: 00000000003406e0 [Fri Apr 3 21:17:38 2020] Call Trace: [Fri Apr 3 21:17:38 2020] ? fuse_request_end+0xbc/0x1f0 [fuse] [Fri Apr 3 21:17:38 2020] ? fuse_dev_do_write+0x25e/0xde0 [fuse] [Fri Apr 3 21:17:38 2020] ? ext4_da_write_end+0xbe/0x2d0 [ext4] [Fri Apr 3 21:17:38 2020] ? copyin+0x28/0x30 [Fri Apr 3 21:17:38 2020] ? iov_iter_copy_from_user_atomic+0xc3/0x370 [Fri Apr 3 21:17:38 2020] ? fuse_dev_write+0x53/0x90 [fuse] [Fri Apr 3 21:17:38 2020] ? do_iter_readv_writev+0x158/0x1d0 [Fri Apr 3 21:17:38 2020] ? do_iter_write+0x7d/0x190 [Fri Apr 3 21:17:38 2020] ? vfs_writev+0xa6/0xf0 [Fri Apr 3 21:17:38 2020] ? do_writev+0x6b/0x110 [Fri Apr 3 21:17:38 2020] ? do_syscall_64+0x52/0x160 [Fri Apr 3 21:17:38 2020] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fri Apr 3 21:17:38 2020] Modules linked in: kvm_amd ccp rng_core kvm irqbypass crct10dif_pclmul crc32_pclmul nft_ct nf_conntrack ghash_clmulni_intel nf_defrag_ipv6 nf_d efrag_ipv4 libcrc32c fuse aesni_intel nft_counter crypto_simd joydev virtio_balloon evdev cryptd glue_helper serio_raw pcspkr button qemu_fw_cfg nf_tables nfnetlink sunrp c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic dm_mod ata_generic virtio_blk e1000 psmouse ehci_pci uhci_hcd ehci_hcd ata_piix libata i2c_piix4 usbco re crc32c_intel scsi_mod virtio_pci virtio_ring usb_common virtio floppy [Fri Apr 3 21:17:38 2020] CR2: ffff9099b6cb4b00 [Fri Apr 3 21:17:38 2020] ---[ end trace f5fa055ba08acd39 ]--- [Fri Apr 3 21:17:38 2020] RIP: 0010:0xffff9099b6cb1e00 [Fri Apr 3 21:17:38 2020] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 [Fri Apr 3 21:17:38 2020] RSP: 0018:ffffa7c54065fc30 EFLAGS: 00010286 [Fri Apr 3 21:17:38 2020] RAX: ffff9099b6cb1e00 RBX: ffff9099b315e880 RCX: 0000000000000000 [Fri Apr 3 21:17:38 2020] RDX: 0000000000000000 RSI: ffffa7c540b83cd0 RDI: ffff9099b9cec600 [Fri Apr 3 21:17:38 2020] RBP: ffff9099b315e870 R08: ffff9099b315e8d0 R09: ffffa7c54065fbd0 [Fri Apr 3 21:17:38 2020] R10: 0000000000001000 R11: ffffa7c540b83d18 R12: ffff9099b9cec600 [Fri Apr 3 21:17:38 2020] R13: ffff9099b6cb1e00 R14: ffff9099b9cbcec0 R15: ffff9099b315e870 [Fri Apr 3 21:17:38 2020] FS: 00007fbbaeb05700(0000) GS:ffff9099bba80000(0000) knlGS:0000000000000000 [Fri Apr 3 21:17:38 2020] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Fri Apr 3 21:17:38 2020] CR2: ffff9099b6cb4b00 CR3: 0000000138cf2000 CR4: 00000000003406e0 |

mergerfs process:

# ps auxww grep merger root 425 8.6 0.3 457092 15296 ? S<s Apr02 279:05 mergerfs /mnt/data/vd* /home/deluge -o rw,allow_other,use_ino,cache.files=auto-full,func.getattr=newest,dropcacheonclose=true,fsname=mergerfs,dev,suid

strace last line (don't have the rest):

|# strace -p 425 strace: Process 425 attached futex(0x7ffd12c88fb0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY |

(I didn't do the memtest yet because I'm lazy 😑 )

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trapexit/mergerfs/issues/727#issuecomment-608967581, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABQILF4L75UYILR3UAM57TRK2U77ANCNFSM4LFJROEQ.

trapexit commented 4 years ago

Please create a separate ticket if unrelated to this one.

Create a minimal reproducible situation? That's far too complex to make any other suggestions. Recent kernels have some known problems. Maybe start there.

taz-007 commented 4 years ago

However I've been bumping into another problem in parallel: sometimes the mergerfs mountpoint gets unresponsive, any access gets stuck forever (for example a simple ls). There are some kernel messages, not sure if related:

About the freeze, might want to take a look at https://github.com/trapexit/mergerfs/issues/708 or https://bugzilla.kernel.org/show_bug.cgi?id=206643 .

alpaca1thunder commented 4 years ago

Not sure how much help this is, but I had similar problems with openSUSE leap 15.1, which runs a similar kernel to Debian Buster. After trying a variety of solutions, I upgraded the kernel to 5.7.0-rc6 and haven't had any issues since, even after testing it under a variety of heavy loads. So I'm guessing it's a kernel issue as well.

alpaca1thunder commented 4 years ago

Not sure how much help this is, but I had similar problems with openSUSE leap 15.1, which runs a similar kernel to Debian Buster. After trying a variety of solutions, I upgraded the kernel to 5.7.0-rc6 and haven't had any issues since, even after testing it under a variety of heavy loads. So I'm guessing it's a kernel issue as well.

Sorry for multiple posts, but I spoke too soon unfortunately, it seemed to crash right after--under a not so particularly high load. I can't seem to reproduce it, the only things that I remember it having in common is that I was fiddling with my machines in KVM at the time, but it (shouldn't) be related at all because they aren't even touching the disks. Haven't found anything relevant in journalctl or dmesg either. I use my mergerFS volume mainly with Docker containers if that makes a difference.

Here's my current fstab entry:

/mnt/M01:/mnt/M02:/mnt/M03:/mnt/M04:/mnt/M05:/mnt/M06:/mnt/M07:/mnt/M08:/mnt/M09:/mnt/M10:/mnt/M11:/mnt/M12 /media/mergerfs/media  fuse.mergerfs defaults,func.getattr=newest,allow_other,use_ino,category.create=mfs,cache.files=partial,dropcacheonclose=true,moveonenospc=true,minfreespace=10G,fsname=mergerFS 0 0

I planned on doing a clean install today anyway, @trapexit , is there anything you suggest doing to get your some more detailed logs in the future? Sorry if it's already in your documentation somewhere. I'll compile from source from the latest git commit.

...Or maybe run some RAM tests on the host machine? Over the past few years I've seen some really weird stuff that somehow manifest through mergerfs. Had someone with bad RAM (confirmed with a memtest86)...

I'll try memtest86 as well, and maybe I can contribute to the docs somehow about debugging hardware before for people with similar problems. I'm pretty determined to find out what's causing this.

trapexit commented 4 years ago

Can try a few things. Depends on your skill level. Easiest is to try earlier versions to see if anything changes. Otherwise build with debugging symbols and run it in gdb to catch where it crashes.

trapexit commented 4 years ago

If it can be replicated in a VM then I can do some testing myself. The problem has been that I haven't been able to reproduce this yet.

trapexit commented 4 years ago

I've got an Ubuntu 20.04 server VM up. Using mergerfs 2.29.0 with debugging enabled (changed optimizations to O0 and -g in make file; i made this easier recently but I want to test the latest release).

In gdb I ran run -f -o use_ino,direct_io /home/user /tmp/test

In another terminal I've got: while true; do dd if=/dev/urandom of=/tmp/test/blah bs=1M count=1024 status=progress; done

Been running for several minutes without issue. Will report back if that changes. If someone could try doing similar on their machine and if it triggers it replicate that in a VM so I could try to replicate?

alpaca1thunder commented 4 years ago

Can try a few things. Depends on your skill level. Easiest is to try earlier versions to see if anything changes. Otherwise build with debugging symbols and run it in gdb to catch where it crashes.

Testing it with a clean install of Debian Buster with mergerfs version: 2.24.2 right now, (the one in the default repository) its been good for 12 hours or so. Rechecking some data from when it crashed last time, seems to be okay so far. I'll rebuild & run it with gdb and post the logs if/when it crashes. Thanks for replying!

trapexit commented 4 years ago

Ran all night. No luck crashing on Ubuntu 20.04.

alpaca1thunder commented 4 years ago

Ran all night. No luck crashing on Ubuntu 20.04.

Been running for a week with the older debian version, no issues. Glad to have it working well, but sorry I couldn't be of more help. I'll try and fire up a VM with a similar setup at some point and try and reproduce it for you.

trapexit commented 4 years ago

Closing this for now.