koverstreet / bcachefs

Other
662 stars 70 forks source link

Filesystem lockup in bch2_write_iter while rsync+qemu when using nocow - a96f1dd35d3d #592

Closed ticpu closed 11 months ago

ticpu commented 11 months ago

Version 6.4.4.arch1.r1189971.a96f1dd35d3d

Generic info Provide the output of:

Kernel bugs Current config:

Kernel log with hung tasks and lock debug info dmesg.txt

Optional Advanced

PS: I'm on IRC as TiCPU but you'll have to tag @tiky for the notification to show in my Discord.

ticpu commented 11 months ago

Might or might not be related, but this morning on Kernel 6.5.1.arch1.r1204718.59180b370b79, it remounted read-only and qemu was stuck in Zombie state, which usually never happens. I see it is related to nocow and locking in the stack which correlates with qemu (the only nocow folder in the whole FS). I'll move back my VM to Btrfs for now.

[84166.479789] invalid journal entry, version=1.2: deleted_inodes type=btree_keys seq=451665 offset=866/2789: 
[84166.479791]   u64s 11 type extent 268757858:105368:4294967291 len 8 ver 8786712: durability: 2 crc: c_size 8 size 8 offset 0 nonce 72 csum chacha20_poly1305_80 compress incompre
ssible ptr: 4:392330:328 gen 0 crc: c_size 8 size 8 offset 0 nonce 0 csum chacha20_poly1305_80 compress incompressible ptr: 3:298653:344 gen 4
[84166.479792]   incorrect nonce, shutting down
[84166.479796] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): inconsistency detected - emergency read only

❯ pstree -t -p 5865
qemu-system-x86(5865)─┬─{vhost-5865}(5871)
                      └─{worker}(5987)

❯ cat /proc/5987/stack
[<0>] __bch2_bucket_nocow_lock+0xdf/0x140 [bcachefs]
[<0>] bch2_nocow_write+0x7a6/0x11d0 [bcachefs]
[<0>] __bch2_write+0xd03/0x1360 [bcachefs]
[<0>] __bch2_writepage+0x34e/0x710 [bcachefs]
[<0>] write_cache_pages+0x1de/0x450
[<0>] bch2_writepages+0x88/0xe0 [bcachefs]
[<0>] do_writepages+0xcf/0x1e0
[<0>] filemap_fdatawrite_wbc+0x63/0x90
[<0>] __filemap_fdatawrite_range+0x9b/0xc0
[<0>] file_write_and_wait_range+0x51/0xc0
[<0>] bch2_fsync+0x27/0x80 [bcachefs]
[<0>] __x64_sys_fdatasync+0x52/0xa0
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
ticpu commented 11 months ago

After rebooting for getting rid of the zombies, the FS didn't mount.

[  155.030870] BTRFS info: devid 2 device path /dev/mapper/vgP4-root2 changed to /dev/dm-12 scanned by (udev-worker) (2394)
[  155.031705] BTRFS info: devid 2 device path /dev/dm-12 changed to /dev/mapper/vgP4-root2 scanned by (udev-worker) (2394)
[  155.036423] BTRFS info: devid 1 device path /dev/mapper/vgP4-root1 changed to /dev/dm-11 scanned by (udev-worker) (2457)
[  155.037303] BTRFS info: devid 1 device path /dev/dm-11 changed to /dev/mapper/vgP4-root1 scanned by (udev-worker) (2457)
[  199.731688] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): mounting version 1.2: deleted_inodes opts=metadata_replicas=3,data_replicas=2,metadata_replicas_required=2,background_compression=zstd,metadata_target=nvme,foreground_target=nvme,background_target=hdd,promote_target=nvme
[  199.731735] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): recovering from unclean shutdown
[  206.186119] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): ja->sectors_free == ca->mi.bucket_size
[  206.186122] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): cur_idx 0/3200
[  206.186123] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): bucket_seq[3199] = 430162
[  206.186124] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): bucket_seq[0] = 430165
[  206.186125] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): bucket_seq[1] = 449496
[  211.254051] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): journal read done, replaying entries 451661-451675
[  211.254058] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): dropped unflushed entries 451676-451676
[  211.254333] invalid journal entry, version=1.2: deleted_inodes type=btree_keys seq=451665 offset=866/2789: 
[  211.254336]   u64s 11 type extent 268757858:105368:4294967291 len 8 ver 8786712: durability: 2 crc: c_size 8 size 8 offset 0 nonce 72 csum chacha20_poly1305_80 compress incompressible ptr: 4:392330:328 gen 0 crc: c_size 8 size 8 offset 0 nonce 0 csum chacha20_poly1305_80 compress incompressible ptr: 3:298653:344 gen 4
[  211.254338]   incorrect nonce, exiting
[  211.254341] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): Unable to continue, halting
[  211.346499] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): alloc_read... done
[  211.386397] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): stripes_read... done
[  211.386401] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): snapshots_read... done
[  211.386410] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): journal_replay...
[  211.386445] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): going read-write
[  217.740021] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): corrupt btree node before write at btree extents level 0/2
[  217.740027]   u64s 13 type btree_ptr_v2 268757858:238792:4294967291 len 0 ver 0: seq c24236a65f2e8e2b written 368 min_key 268757280:8:U32_MAX durability: 3 ptr: 1:7260:0 gen 12 ptr: 0:8256:0 gen 11 ptr: 2:338808:0 gen 9
[  217.740031]   node offset 368 bset u64s 90: invalid bkey:  incorrect nonce
[  217.740033]   u64s 11 type extent 268757858:105368:4294967291 len 8 ver 8786712: durability: 2 crc: c_size 8 size 8 offset 0 nonce 72 csum chacha20_poly1305_80 compress incompressible ptr: 4:392330:328 gen 0 crc: c_size 8 size 8 offset 0 nonce 0 csum chacha20_poly1305_80 compress incompressible ptr: 3:298653:344 gen 4
[  217.740383] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): inconsistency detected - emergency read only
[  217.740391] CPU: 5 PID: 2665 Comm: bcachefs Tainted: P           OE      6.5.1.arch1.r1204718.59180b370b79 #1 716e84a726fd8842031d3eb4f38f44397db10c41
[  217.740403] Hardware name: Supermicro C9X299-PG300F/C9X299-PG300F, BIOS 2.4 09/08/2021
[  217.740408] Call Trace:
[  217.740412]  <TASK>
[  217.740416]  dump_stack_lvl+0x47/0x60
[  217.740433]  validate_bset_for_write+0xc4/0x150 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.740531]  __bch2_btree_node_write+0xbc1/0xce0 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.740583]  bch2_btree_node_write+0x65/0x120 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.740632]  __btree_node_flush+0xe8/0x130 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.740685]  ? __pfx_bch2_btree_node_flush0+0x10/0x10 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.740738]  journal_flush_pins.constprop.0+0x1a3/0x2a0 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.740805]  journal_flush_done+0xc1/0xd0 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.740869]  bch2_journal_flush_pins+0xd0/0x130 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.740933]  bch2_journal_replay+0x167/0x430 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.740997]  bch2_fs_recovery+0x11ed/0x1600 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.741050]  ? bch2_printbuf_exit+0x20/0x30 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.741103]  ? print_mount_opts+0x2ae/0x3d0 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.741152]  ? __bch2_sb_field_resize+0x6f/0x110 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.741201]  bch2_fs_start+0x432/0x480 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.741250]  bch2_fs_open+0x3ab/0x4b0 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.741300]  bch2_mount+0x53a/0x740 [bcachefs e30c856e130dc1c6b066081404bce608cd65fa74]
[  217.741354]  legacy_get_tree+0x28/0x50
[  217.741357]  vfs_get_tree+0x26/0xd0
[  217.741360]  path_mount+0x48e/0xaa0
[  217.741363]  __x64_sys_mount+0x11a/0x150
[  217.741366]  do_syscall_64+0x5d/0x90
[  217.741369]  ? exc_page_fault+0x7f/0x180
[  217.741371]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  217.741374] RIP: 0033:0x7f143471204e
[  217.741398] Code: 48 8b 0d e5 bc 12 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 bc 12 00 f7 d8 64 89 01 48
[  217.741399] RSP: 002b:00007ffd858bbb18 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[  217.741401] RAX: ffffffffffffffda RBX: 0000562547252460 RCX: 00007f143471204e
[  217.741402] RDX: 0000562547252460 RSI: 0000562547250b60 RDI: 0000562547254790
[  217.741403] RBP: 0000562547254790 R08: 0000000000000000 R09: 0000000000000051
[  217.741404] R10: 0000000000000400 R11: 0000000000000246 R12: 0000000047254701
[  217.741405] R13: 0000562547255bb0 R14: 0000000000000008 R15: 0000000000000062
[  217.741407]  </TASK>
[  217.744218] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): bch2_journal_replay(): error EIO
[  217.744254] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): bch2_fs_recovery(): error EIO
[  217.744255] bcachefs (d0337b38-cbe7-4c3c-90f7-4305fbe05171): bch2_fs_start(): error starting filesystem EIO

I updated the journal dump, I compressed it and it is ready to share. I made pictures of the bcachefs fsck, it deleted many inodes that aren't related to the qemu image (pyc files). The FS is now working but I'm not touching that nocow folder until further instructions.

ticpu commented 11 months ago

IMG_20230920_094519 IMG_20230920_093936 IMG_20230920_094031 Here is the interesting logs from the bcachefs fsck.

ticpu commented 11 months ago

After bcachefs fsck, bcachefs fs usage shows that the 2 nvme devices finally have to free space and are using the cached section for the first time in days.

user:           1/1             [dm-18]                   6062285312
user:           1/2             [dm-20 dm-19]           495145382400
user:           1/1             [dm-21]                   2196922368
user:           1/1             [dm-20]                  16046017024
user:           1/2             [dm-19 dm-18]           102227339264
user:           1/1             [dm-22]                   1580744704
user:           1/1             [dm-19]                  16563623936
user:           1/2             [dm-21 dm-22]              549470208
user:           1/2             [dm-20 dm-18]            92125683200
cached:         1/1             [dm-22]                  17027820544
cached:         1/1             [dm-21]                  16893637120
hdd.1 (device 2):              dm-20              rw
  free:                            0          838650
  btree:                  5090836480           19420
  user:                 309681739264         1181725     100469760
  cached:                          0               0
  need_discard:                    0               0
  capacity:             536870912000         2048000
hdd.2 (device 3):              dm-19              rw
  free:                            0          836821
  btree:                           0               0
  user:                 315249963008         1202974     102546432
  cached:                          0               0
  need_discard:                    0               0
  capacity:             536870912000         2048000
hdd.3 (device 4):              dm-18              rw
  free:                            0           94317
  btree:                           0               0
  user:                 103238628864          393990      43519488
  cached:                          0               0
  need_discard:                    0               0
  capacity:             128849018880          491520
nvme.1 (device 0):             dm-21              rw
  free:                            0          233685
  btree:                  5090836480           19420
  user:                   2471657472            9494      17137664
  cached:                16893637120           64428
  need_discard:                    0               0
  capacity:              85899345920          327680
nvme.2 (device 1):             dm-22              rw
  free:                            0          235519
  btree:                  5090836480           19420
  user:                   1855479808            7137      15441920
  cached:                17027820544           64950
  need_discard:                    0               1
  capacity:              85899345920          327680
koverstreet commented 11 months ago

So: we ended up with a replicated encrypted extent where the nonces for the different replicas didn't line up - that's not supposed to happen.

And we don't know where it came from, and it appears this got persisted into the journal (definitely not supposed to happen!).

So: I've made the bkey invalid check in the transaction commit path an always-on thing, not a debug mode check, since that would have caught this earlier, where we could've found out where it came from.

If you can dump your filesystem metadata for me, we might be able to get some useful information out of the journal, as well.

Hit me up on IRC, we'll see what we can do

ticpu commented 11 months ago

Upgraded to kernel b56e84bccab0 and I am still running the workload without any more negative locks (MIN_INT) shown in /sys/fs/bcachefs/d0337b38-cbe7-4c3c-90f7-4305fbe05171/internal/nocow_lock_table