koverstreet / bcachefs

Other
643 stars 71 forks source link

Iterator overflow when attempting to replay large journal #649

Closed boomshroom closed 5 months ago

boomshroom commented 5 months ago

NixOS unstable; Linux 6.8-rc2; bcachefs-tools f15633cce1b79e708e9debc21c7b8772df7c7a29

I am uncertain how to replicate this issue, but it only appeared for me after forcing a reboot after a watchdog error that seems to have been reported on other filesystems. This was also after adding my 4th device.

When attempting to mount, bcachefs panics with a transaction_restart_too_many_iters error. This causes a kernel panic when attempting to mount with the kernel implementation, and an abort when doing an offline fsck with bcachefs-tools. Attempting to mount with ro,nochanges,norecovery appears to work just fine, and a subsequent online read only fsck reports no errors and exits cleanly.

When inspecting the backtraces given by GDB, the file system seems to be having trouble allocating buckets. This led me to try testing with fsck mounting the file system with metadata_replicas=1,data_replicas=1, as well as metadata_target set to my largest device with plenty of free space. While this seemed to change when the error happened, it didn't change the error itself.

All testing was performed with -n or nochanges to minimize risk of file system corruption and to provide consistent inputs to each attempt. However it is conceivable that even a partial fsck could potentially help. If you believe that is the case, please let me know.

`show-super`: ``` External UUID: 4d4903de-99c8-41a1-ab6a-83431f05b0dc Internal UUID: e6a58fab-3bb3-477a-8f1f-256d9d85f946 Magic number: c68573f6-66ce-90a9-d96a-60cf803df7ef Device index: 1 Label: Version: 1.4: member_seq Version upgrade complete: 1.4: member_seq Oldest version on disk: 1.3: rebalance_work Created: Fri Dec 29 06:23:27 2023 Sequence number: 379 Time of last write: Sun Feb 4 00:41:50 2024 Superblock size: 7568 Clean: 0 Devices: 4 Sections: members_v1,replicas_v0,disk_groups,clean,journal_seq_blacklist,journal_v2,counters,members_v2,errors,ext,downgrade Features: zstd,journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,reflink_inline_data,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes Compat features: alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done Options: block_size: 4.00 KiB btree_node_size: 256 KiB errors: continue [ro] panic metadata_replicas: 3 data_replicas: 2 metadata_replicas_required: 1 data_replicas_required: 1 encoded_extent_max: 256 KiB metadata_checksum: none crc32c crc64 [xxhash] data_checksum: none crc32c crc64 [xxhash] compression: none background_compression: zstd:15 str_hash: crc32c crc64 [siphash] metadata_target: none foreground_target: ssd background_target: hdd promote_target: ssd erasure_code: 0 inodes_32bit: 1 shard_inode_numbers: 1 inodes_use_key_cache: 1 gc_reserve_percent: 8 gc_reserve_bytes: 0 B root_reserve_percent: 0 wide_macs: 0 acl: 1 usrquota: 0 grpquota: 0 prjquota: 0 journal_flush_delay: 1000 journal_flush_disabled: 0 journal_reclaim_delay: 100 journal_transaction_names: 1 version_upgrade: [compatible] incompatible none nocow: 0 members_v2 (size 528): Device: 0 Label: nvme (1) UUID: 485acf5f-9532-4f5e-89b4-49e700b0be08 Size: 434 GiB read errors: 0 write errors: 0 checksum errors: 5533 seqread iops: 0 seqwrite iops: 0 randread iops: 0 randwrite iops: 0 Bucket size: 256 KiB First bucket: 0 Buckets: 1778668 Last mount: Fri Feb 2 23:03:03 2024 Last superblock write: 379 State: rw Data allowed: journal,btree,user Has data: journal,btree,user,cached Durability: 1 Discard: 1 Freespace initialized: 1 Device: 1 Label: sata (2) UUID: 9324a247-ac99-4af7-a82b-f119b57fbde3 Size: 466 GiB read errors: 0 write errors: 0 checksum errors: 216396 seqread iops: 0 seqwrite iops: 0 randread iops: 0 randwrite iops: 0 Bucket size: 256 KiB First bucket: 0 Buckets: 1907760 Last mount: Fri Feb 2 23:03:03 2024 Last superblock write: 379 State: rw Data allowed: journal,btree,user Has data: journal,btree,user,cached Durability: 1 Discard: 1 Freespace initialized: 1 Device: 2 Label: black (4) UUID: 222d1588-f04e-4845-bd8b-eb7d1cac22d1 Size: 932 GiB read errors: 0 write errors: 0 checksum errors: 7 seqread iops: 0 seqwrite iops: 0 randread iops: 0 randwrite iops: 0 Bucket size: 256 KiB First bucket: 0 Buckets: 3815478 Last mount: Fri Feb 2 23:03:03 2024 Last superblock write: 379 State: rw Data allowed: journal,btree,user Has data: journal,btree,user,cached Durability: 1 Discard: 1 Freespace initialized: 1 Device: 3 Label: green (5) UUID: fe30286b-b190-4e19-9f10-ec4abf57e00a Size: 1.82 TiB read errors: 0 write errors: 0 checksum errors: 3 seqread iops: 0 seqwrite iops: 0 randread iops: 0 randwrite iops: 0 Bucket size: 256 KiB First bucket: 0 Buckets: 7630916 Last mount: Fri Feb 2 23:03:03 2024 Last superblock write: 379 State: rw Data allowed: journal,btree,user Has data: journal,btree,user,cached Durability: 1 Discard: 1 Freespace initialized: 1 errors (size 104): journal_entry_replicas_not_marked 1 Fri Jan 19 22:20:23 2024 alloc_key_to_missing_lru_entry 138 Wed Jan 31 04:36:54 2024 bucket_gens_key_wrong 1085535 Wed Jan 31 04:35:13 2024 bucket_gens_hole_wrong 45 Sun Jan 28 08:59:45 2024 need_discard_freespace_key_bad 3627964 Sun Jan 28 09:00:15 2024 inode_unreachable 16 Wed Jan 31 04:37:09 2024 ```
`fs usage` (possible thanks to `nochanges,norecovery`): ``` Filesystem: 4d4903de-99c8-41a1-ab6a-83431f05b0dc Size: 3649620211200 Used: 2033523975168 Online reserved: 0 Data type Required/total Durability Devices reserved: 1/1 [] 168861696 btree: 1/3 3 [nvme0n1p3 sda sdc] 46257930240 btree: 1/3 3 [nvme0n1p3 sda sdb] 3782737920 user: 1/2 2 [nvme0n1p3 sdc] 3215839232 user: 1/1 1 [sdb] 330631680 user: 1/2 2 [nvme0n1p3 sda] 29035028480 user: 1/2 2 [sda sdc] 4138692608 user: 1/1 1 [sdc] 138863104 user: 1/2 2 [nvme0n1p3 sdb] 16288983040 user: 1/2 2 [sda sdb] 16796411392 user: 1/2 2 [sdb sdc] 1904763790848 cached: 1/1 1 [nvme0n1p3] 393283944448 cached: 1/1 1 [sdb] 1363968 cached: 1/1 1 [sda] 421215572992 cached: 1/1 1 [sdc] 102392320 hdd.black (device 2): sdb rw data buckets fragmented free: 29407838208 112182 sb: 3149824 13 258048 journal: 2147483648 8192 btree: 1260912640 4810 user: 967800822272 3690281 48991744 cached: 1363968 0 parity: 0 0 stripe: 0 0 need_gc_gens: 0 0 need_discard: 0 0 capacity: 1000204664832 3815478 hdd.green (device 3): sdc rw data buckets fragmented free: 1024887095296 3909634 sb: 3149824 13 258048 journal: 2147483648 8192 btree: 15419310080 58820 user: 957633909248 3653266 52239872 cached: 102392320 991 parity: 0 0 stripe: 0 0 need_gc_gens: 0 0 need_discard: 0 0 capacity: 2000398843904 7630916 ssd.nvme (device 0): nvme0n1p3 rw data buckets fragmented free: 18343002112 69973 sb: 3149824 13 258048 journal: 2147483648 8192 btree: 16680222720 63630 user: 24279286784 92965 90947584 cached: 393283944448 1543894 parity: 0 0 stripe: 0 0 need_gc_gens: 262144 1 need_discard: 0 0 capacity: 466267144192 1778668 ssd.sata (device 1): sda rw data buckets fragmented free: 19664732160 75015 sb: 3149824 13 258048 journal: 2147483648 8192 btree: 16680222720 63630 user: 24994222080 95702 93495296 cached: 421215572992 1665208 parity: 0 0 stripe: 0 0 need_gc_gens: 0 0 need_discard: 0 0 capacity: 500107837440 1907760 ```
`fsck -nvr` output log: ``` mounting version 1.4: member_seq opts=ro,metadata_replicas=3,data_replicas=2,metadata_checksum=xxhash,data_checksum=xxhash,background_compression=zstd:15,foreground_target=ssd,background_target=hdd,promote_target=ssd,degraded,verbose,fsck,fix_errors=no,ratelimit_errors,nochanges,read_only recovering from unclean shutdown starting journal read journal read done on device /dev/nvme0n1p3, ret 0 journal read done on device /dev/sda, ret 0 journal read done on device /dev/sdb, ret 0 journal read done on device /dev/sdc, ret 0 journal read done, replaying entries 91066762-91080059 Journal keys: 6801760 read, 2745680 after sorting and compacting alloc_read... done stripes_read... done snapshots_read... done check_allocations... done going read-write journal_replay...bch2_btree_update_start(): error journal_reclaim_would_deadlock bch2_btree_update_start(): error journal_reclaim_would_deadlock bch2_btree_update_start(): error journal_reclaim_would_deadlock bch2_btree_update_start(): error journal_reclaim_would_deadlock bch2_btree_update_start(): error journal_reclaim_would_deadlock bch2_btree_update_start(): error journal_reclaim_would_deadlock bch2_btree_update_start(): error journal_reclaim_would_deadlock bch2_btree_update_start(): error journal_reclaim_would_deadlock bch2_btree_update_start(): error journal_reclaim_would_deadlock in transaction restart: transaction_restart_too_many_iters, last restarted by 0x452da9S bcachefs: libbcachefs/btree_iter.c:1369: bch2_trans_in_restart_error: Assertion `0' failed. ```
GDB backtrace for `fsck -nvr` with `metadata_replicas=1,data_replicas=1` (chosen primarily for being on the main thread): ``` #5 0x00005555555f2a45 in bch2_trans_in_restart_error (trans=) at libbcachefs/btree_iter.c:1369 #6 0x00005555555f7664 in bch2_trans_verify_not_in_restart (trans=0x55555be4d000) at libbcachefs/btree_iter.h:309 #7 bch2_path_get (trans=trans@entry=0x55555be4d000, btree_id=btree_id@entry=BTREE_ID_freespace, pos=..., locks_want=locks_want@entry=0, level=level@entry=0, flags=, ip=ip@entry=93824992688791) at libbcachefs/btree_iter.c:1603 #8 0x00005555555c3b93 in bch2_trans_iter_init_common (locks_want=0, depth=0, ip=, flags=, pos=..., btree_id=11, iter=0x7fffffffbe70, trans=0x55555be4d000) at libbcachefs/btree_iter.h:471 #9 bch2_trans_iter_init (flags=0, pos=..., btree_id=11, iter=0x7fffffffbe70, trans=0x55555be4d000) at libbcachefs/btree_iter.h:485 #10 bch2_bucket_alloc_freelist (cl=0x0, s=0x7fffffffbdb0, watermark=BCH_WATERMARK_btree, ca=0x555555995400, trans=0x55555be4d000) at libbcachefs/alloc_foreground.c:474 #11 bch2_bucket_alloc_trans (trans=trans@entry=0x55555be4d000, ca=ca@entry=0x555555995400, watermark=watermark@entry=BCH_WATERMARK_btree, cl=cl@entry=0x0, usage=usage@entry=0x7fffffffc030) at libbcachefs/alloc_foreground.c:569 #12 0x00005555555c5753 in bch2_bucket_alloc_set_trans (trans=trans@entry=0x55555be4d000, ptrs=ptrs@entry=0x7fffffffc540, stripe=stripe@entry=0x7ffff7a9ec30, devs_may_alloc=devs_may_alloc@entry=0x7fffffffc2f8, nr_replicas=nr_replicas@entry=1, nr_effective=nr_effective@entry=0x7fffffffc4ec, have_cache=0x7fffffffc4eb, flags=0, data_type=BCH_DATA_btree, watermark=BCH_WATERMARK_btree, cl=0x0) at libbcachefs/alloc_foreground.c:744 #13 0x00005555555c5dfe in __open_bucket_add_buckets (trans=trans@entry=0x55555be4d000, ptrs=ptrs@entry=0x7fffffffc540, wp=wp@entry=0x7ffff7a9ebc0, devs_have=devs_have@entry=0x7fffffffc650, target=target@entry=257, erasure_code=erasure_code@entry=false, nr_replicas=1, nr_effective=0x7fffffffc4ec, have_cache=0x7fffffffc4eb, watermark=BCH_WATERMARK_btree, flags=0, _cl=0x0) at libbcachefs/alloc_foreground.c:994 #14 0x00005555555c6716 in open_bucket_add_buckets (erasure_code=0, cl=0x0, flags=, watermark=, have_cache=, nr_effective=, nr_replicas=, target=, devs_have=, wp=0x7ffff7a9ebc0, ptrs=, trans=) at libbcachefs/alloc_foreground.c:1038 #15 bch2_alloc_sectors_start_trans (trans=trans@entry=0x55555be4d000, target=257, erasure_code=erasure_code@entry=0, write_point=..., write_point@entry=..., devs_have=devs_have@entry=0x7fffffffc650, nr_replicas=1, nr_replicas_required=1, watermark=BCH_WATERMARK_btree, flags=0, cl=0x7fffffffc790, wp_ret=0x7fffffffc648) at libbcachefs/alloc_foreground.c:1353 #16 0x000055555560b80b in __bch2_btree_node_alloc (flags=403, interior_node=false, cl=0x7fffffffc790, res=0x555555989070, trans=0x55555be4d000) at libbcachefs/alloc_foreground.h:214 #17 bch2_btree_reserve_get (cl=, flags=, nr_nodes=, as=, trans=) at libbcachefs/btree_update_interior.c:489 #18 bch2_btree_reserve_get (trans=0x55555be4d000, as=0x555555989000, nr_nodes=, flags=, cl=0x7fffffffc790) at libbcachefs/btree_update_interior.c:465 #19 0x000055555560edf5 in bch2_btree_update_start (trans=trans@entry=0x55555be4d000, path=, level=, split=, split@entry=true, flags=, flags@entry=400) at libbcachefs/btree_update_interior.c:1173 #20 0x0000555555612084 in bch2_btree_split_leaf (trans=trans@entry=0x55555be4d000, path=, flags=flags@entry=400) at libbcachefs/btree_update_interior.c:1755 #21 0x0000555555607130 in bch2_trans_commit_error (trace_ip=, ret=-2193, i=, flags=, trans=0x55555be4d000) at libbcachefs/btree_trans_commit.c:893 #22 __bch2_trans_commit (trans=trans@entry=0x55555be4d000, flags=) at libbcachefs/btree_trans_commit.c:1109 #23 0x000055555567934e in bch2_trans_commit (flags=, journal_seq=0x0, disk_res=0x0, trans=0x55555be4d000) at libbcachefs/btree_update.h:167 #24 bch2_journal_replay (c=0x7ffff7a88000) at libbcachefs/recovery.c:176 #25 0x00005555556786f5 in bch2_run_recovery_pass (c=c@entry=0x7ffff7a88000, pass=pass@entry=BCH_RECOVERY_PASS_journal_replay) at libbcachefs/recovery.c:684 #26 0x000055555567b137 in bch2_run_recovery_passes (c=) at libbcachefs/recovery.c:701 #27 bch2_fs_recovery (c=0x7ffff7a88000) at libbcachefs/recovery.c:977 #28 0x000055555569a065 in bch2_fs_start (c=c@entry=0x7ffff7a88000) at libbcachefs/super.c:1012 #29 0x000055555569d85d in bch2_fs_open (devices=devices@entry=0x55555593cac0, nr_devices=nr_devices@entry=4, opts=...) at libbcachefs/super.c:2062 #30 0x00005555555afb8e in cmd_fsck (argc=, argv=) at c_src/cmd_fsck.c:175 #31 0x000055555559a049 in bcachefs::main () ```

I have also recorded the output of bcachefs list-journal, though I'm unsure of how to share it. If it would be helpful and you have an idea of how I should post it, please let me know.

boomshroom commented 5 months ago

This appears to have been resolved (though revealed other errors like deadlocks). Updating bcachefs-tools after the recent commit sync no longer triggered the specific problem mentioned in this issue.