koverstreet / bcachefs

Other
633 stars 69 forks source link

Rebalancing and Filesystem hangs / stops writing data #677

Open EvilDragon opened 1 month ago

EvilDragon commented 1 month ago

Might be related to issue #673, but could be someting different.

Some general information about my system:

Kernel: 6.8.9-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 02 May 2024 17:49:46 +0000 x86_64 GNU/Linux

bcachefs fs usage: [root@shiranui ~]# bcachefs fs usage /data/Server/ Filesystem: a82b31db-a070-4836-adb9-8cbc9da9d7de Size: 46001864919040 Used: 33063644488704 Online reserved: 2174943232

Data type Required/total Durability Devices btree: 1/2 2 [sdc sdd] 97009532928 btree: 1/2 2 [sdd sda] 46555201536 btree: 1/2 2 [sdc sda] 35396780032 btree: 1/2 2 [sda sdb] 4547674112 user: 1/2 2 [sdc sdd] 21990185719296 user: 1/2 2 [sdd sdb] 5945180940800 user: 1/2 2 [sdc sdb] 1050840915456 user: 1/2 2 [sdc sda] 444664465408 user: 1/2 2 [sdd sda] 3414381998080 user: 1/2 2 [sda sdb] 6886801920 cached: 1/1 1 [sdb] 172820480

hdd.12tb (device 0): sdc rw data buckets fragmented free: 181705113600 346575 sb: 3149824 7 520192 journal: 4294967296 8192 btree: 66203156480 135945 5071175680 user: 11742861336576 22397729 167936 cached: 0 0 parity: 0 0 stripe: 0 0 need_gc_gens: 0 0 need_discard: 0 0 capacity: 12000138625024 22888448

hdd.16tb (device 1): sdd rw data buckets fragmented free: 244386365440 466130 sb: 3149824 7 520192 journal: 4294967296 8192 btree: 71782367232 147326 5458886656 user: 15674973954560 29897641 512000 cached: 0 0 parity: 0 0 stripe: 0 0 need_gc_gens: 0 0 need_discard: 0 0 capacity: 16000900661248 30519296

hdd.20tb (device 3): sdb rw data buckets fragmented free: 16487426293760 15723635 sb: 3149824 4 1044480 journal: 8589934592 8192 btree: 2273837056 2421 264765440 user: 3501317553664 3339117 512000 cached: 172820480 679 parity: 0 0 stripe: 0 0 need_gc_gens: 0 0 need_discard: 0 0 capacity: 20000588955648 19074048

ssd.2tb (device 2): sda rw data buckets fragmented free: 916455424 874 sb: 3149824 4 1044480 journal: 8589934592 8192 btree: 43249827840 55218 14650441728 user: 1932987996160 1843441 cached: 0 0 parity: 0 0 stripe: 0 0 need_gc_gens: 0 0 need_discard: 0 0 capacity: 2000398843904 1907729

What did I do? (not sure it's necessary, but maybe it helps debugging)

My server so far was running with on 20TB HDD and a 16TB HDD, both on btrfs. The 16TB was just a backup of the most important stuff (nightly rsync).

My plan is to switch over to bcachefs with multiple drives, using replica=2 and snapshots to be able to recover a file in case I accidentally delete it.

I had another 12tb and 8tb hdd at hand which I wanted to use to increase the storage size and copy stuff over.

My final planned setup was: bcachefs: Use the 20TB, 16TB and 12TB hdd with replicas (I later had the idea to use a 2TB ssd as cache as well).

My steps to migrate to bcachefs was:

  1. Create a bcachefs with the 12TB and 16TB HD bcachefs format \ --label=ssd.16tb /dev/sda \ --label=hdd.12tb /dev/sdb \ --replicas=2 \ --foreground_target=ssd \ --promote_target=ssd \ --background_target=hdd

    (I did make an SSD target as I planned to add a 2TB SSD the next days and I wasn't sure setting the targets work well when done later). Note: Don't mind the drive devices, these got mixed up later when I switched and added the devices, to they're no accurate to what fs usage returned)

  2. Copy 12TB of data to the new bcachefs filesystem (yeah, that's pretty close to being full).

  3. Copy the remaining data to the 8TB HDD (temporary)

  4. Add the newly bought SSD and change the group of the existing hdd: bcachefs device add --label=ssd.2tb /data/Server /dev/sdd echo hdd.16tb > /sys/fs/bcachefs/a82b31db-a070-4836-adb9-8cbc9da9d7de/dev-1/label

  5. After that, I added the 20tb HD as well: bcachefs device add --label=hdd.20tb /data/Server /dev/sdc This finalized my new bcachefs setup.

  6. Finally, I started to copy everything from the 8TB HDD over to the bcachefs array as well. This worked well for the first 5TB - after that, bcachefs started to act up - and since then, I couldn't get it to work properly.

What's the problem now?

I've got multiple problems, but maybe these are related. I can mount the bcachefs without any problems and access it as wel. However, the rebalancing task crashes early on:

[ 245.507447] INFO: task bch-rebalance/a:1676 blocked for more than 122 seconds. [ 245.507459] Not tainted 6.8.9-arch1-1 #1 [ 245.507464] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 245.507468] task:bch-rebalance/a state:D stack:0 pid:1676 tgid:1676 ppid:2 flags:0x00004000 [ 245.507479] Call Trace: [ 245.507484] [ 245.507494] schedule+0x3e6/0x1520 [ 245.507520] schedule+0x32/0xd0 [ 245.507530] closure_sync+0x82/0x160 [ 245.507545] bch2_write+0x1154/0x13b0 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 245.507691] ? psi_group_change+0x213/0x3c0 [ 245.507704] ? srso_return_thunk+0x5/0x5f [ 245.507711] ? psi_task_switch+0x122/0x230 [ 245.507719] ? srso_return_thunk+0x5/0x5f [ 245.507725] ? local_clock_noinstr+0xd/0xb0 [ 245.507732] ? srso_return_thunk+0x5/0x5f [ 245.507737] ? srso_return_thunk+0x5/0x5f [ 245.507747] ? bch2_moving_ctxt_do_pending_writes+0x11c/0x230 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 245.507871] bch2_moving_ctxt_do_pending_writes+0x11c/0x230 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 245.508002] bch2_move_ratelimit+0x1d0/0x480 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 245.508127] ? pfx_autoremove_wake_function+0x10/0x10 [ 245.508139] do_rebalance+0x1a1/0x8c0 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 245.508261] ? srso_return_thunk+0x5/0x5f [ 245.508267] ? lruvec_stat_mod_folio+0x81/0xa0 [ 245.508291] ? srso_return_thunk+0x5/0x5f [ 245.508297] ? local_clock_noinstr+0xd/0xb0 [ 245.508303] ? srso_return_thunk+0x5/0x5f [ 245.508308] ? srso_return_thunk+0x5/0x5f [ 245.508314] ? bch2_trans_get+0x177/0x260 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 245.508414] ? srso_return_thunk+0x5/0x5f [ 245.508422] ? pfx_bch2_rebalance_thread+0x10/0x10 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 245.508542] bch2_rebalance_thread+0x66/0xb0 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 245.508664] ? bch2_rebalance_thread+0x5c/0xb0 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 245.508794] kthread+0xe8/0x120 [ 245.508803] ? __pfx_kthread+0x10/0x10 [ 245.508812] ret_from_fork+0x34/0x50 [ 245.508821] ? pfx_kthread+0x10/0x10 [ 245.508829] ret_from_fork_asm+0x1b/0x30 [ 245.508844]

It does that regularly. I take it the rebalancer should move some files from the 12TB and 16TB HDDs over to the 20TB HDD but crashes while doing that? It would explain why both of them are nearly full whereas the 20TB is only filled with a couple of TB (as you can see in the fs usage above).

The second problem is, that the filesystem itself also crashes after a while when I try to write to it, especially with files 1GB or bigger. I was not able to finish copying the remaining 2 - 3 TB data from the 8TB temporary disk.

This is the dmesg:

[ 480.540505] ------------[ cut here ]------------ [ 480.540515] btree trans held srcu lock (delaying memory reclaim) for 15 seconds [ 480.540545] WARNING: CPU: 3 PID: 1670 at fs/bcachefs/btree_iter.c:2825 bch2_trans_srcu_unlock+0x120/0x130 [bcachefs] [ 480.540663] Modules linked in: xt_multiport ip6table_filter ip6_tables xt_conntrack xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 x> [ 480.540831] crypto_simd drm_exec snd_acp_config cryptd mdio_devres snd gpu_sched snd_soc_acpi sp5100_tco i2c_amd_mp2_pci drm_suballoc_helper rapl snd_pci_acp3x pcspkr acpi_cpufreq libphy soundcore k10t> [ 480.540943] CPU: 3 PID: 1670 Comm: bch-reclaim/a82 Not tainted 6.8.9-arch1-1 #1 b1154d98cdfe0792477fb31da8bf552e17e27d3c [ 480.540951] Hardware name: Default string Default string/Default string, BIOS 5.0.1.3 08/08/2019 [ 480.540955] RIP: 0010:bch2_trans_srcu_unlock+0x120/0x130 [bcachefs] [ 480.541074] Code: 48 8b 15 33 f6 6c cf 48 c7 c7 e8 84 0c c2 48 b8 07 3a 6d a0 d3 06 3a 6d 48 29 ca 48 f7 e2 48 89 d6 48 c1 ee 07 e8 70 1a 98 cd <0f> 0b e9 5c ff ff ff 0f 0b e9 6b ff ff ff 66 90 90 90 90> [ 480.541081] RSP: 0018:ffffaf73cd54bbd8 EFLAGS: 00010282 [ 480.541088] RAX: 0000000000000000 RBX: ffff9d37a1aa4000 RCX: 0000000000000027 [ 480.541094] RDX: ffff9d3950ee19c8 RSI: 0000000000000001 RDI: ffff9d3950ee19c0 [ 480.541098] RBP: ffff9d3659f00000 R08: 0000000000000000 R09: ffffaf73cd54ba68 [ 480.541103] R10: ffffaf73cd54ba60 R11: 0000000000000003 R12: ffffaf73cd54bcd0 [ 480.541108] R13: ffff9d37a1aa4000 R14: ffff9d3659f036d0 R15: ffff9d3659f26c40 [ 480.541113] FS: 0000000000000000(0000) GS:ffff9d3950ec0000(0000) knlGS:0000000000000000 [ 480.541120] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 480.541125] CR2: 000075b2fcd98000 CR3: 000000010b012000 CR4: 00000000003506f0 [ 480.541131] Call Trace: [ 480.541138] [ 480.541143] ? bch2_trans_srcu_unlock+0x120/0x130 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.541277] ? warn+0x81/0x130 [ 480.541289] ? bch2_trans_srcu_unlock+0x120/0x130 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.541391] ? report_bug+0x171/0x1a0 [ 480.541401] ? prb_read_valid+0x1b/0x30 [ 480.541409] ? srso_return_thunk+0x5/0x5f [ 480.541420] ? handle_bug+0x3c/0x80 [ 480.541428] ? exc_invalid_op+0x17/0x70 [ 480.541435] ? asm_exc_invalid_op+0x1a/0x20 [ 480.541450] ? bch2_trans_srcu_unlock+0x120/0x130 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.541562] ? bch2_trans_srcu_unlock+0x120/0x130 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.541679] bch2_trans_begin+0x63b/0x690 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.541793] ? bch2_trans_begin+0xe5/0x690 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.541901] ? srso_return_thunk+0x5/0x5f [ 480.541908] ? schedule+0x3ee/0x1520 [ 480.541917] ? sysvec_apic_timer_interrupt+0xe/0x90 [ 480.541930] bch2_btree_write_buffer_flush_locked+0x6b/0x980 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.542056] ? pfx_bch2_btree_write_buffer_journal_flush+0x10/0x10 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.542172] btree_write_buffer_flush_seq+0x258/0x2a0 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.542286] ? srso_return_thunk+0x5/0x5f [ 480.542292] ? local_clock_noinstr+0xd/0xb0 [ 480.542301] ? pfx_bch2_btree_write_buffer_journal_flush+0x10/0x10 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.542415] bch2_btree_write_buffer_journal_flush+0x35/0x60 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.542530] journal_flush_pins.constprop.0+0x1ad/0x2d0 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.542669] bch2_journal_reclaim+0x1d1/0x360 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.542809] bch2_journal_reclaim_thread+0x7f/0x170 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.542946] ? pfx_bch2_journal_reclaim_thread+0x10/0x10 [bcachefs 267e5ec8bb3d305ae65d058b667703fa6d81777c] [ 480.543079] kthread+0xe8/0x120 [ 480.543089] ? pfx_kthread+0x10/0x10 [ 480.543098] ret_from_fork+0x34/0x50 [ 480.543106] ? pfx_kthread+0x10/0x10 [ 480.543114] ret_from_fork_asm+0x1b/0x30 [ 480.543131] [ 480.543134] ---[ end trace 0000000000000000 ]--- [ 2468.010057] ------------[ cut here ]------------

After that happens, I can still read the filesystem without any problem, but trying to write anything to it simply hangs. Additionally, the cache doesn't empty anymore (also see fs usage above) so the copied data is not written to the disk (the files have a size of 0 after a reboot).

This also makes it impossible to unmount or sync the filesystem, so I can only do a forced shutdown.

After a reboot, the filesystem mounts again without any problems, the rebalancer crashes again and copying more files to the bcachefs filesystem also crashes again after a few seconds.

One thing I noticed is that the filesystem doesn't crash if I copy a directory FROM the bcachefs array TO the bcachefs array (the same one). It reliably creates copies, regardless the size.

Not sure what the difference here is.

Let me know if I should post you any more output / logs.

Any ideas how to fix this?

EvilDragon commented 1 month ago

Seems to be related / a duplicate to #680