koverstreet / bcachefs

Other
686 stars 71 forks source link

Unable to mount: "journal write overran available space" #720

Closed Lykos153 closed 1 month ago

Lykos153 commented 2 months ago

After a couple of days with recurring #717 I am now no longer able to mount the fs at all. Things I did in the meantime:

Unfortunately, during rereplicate, I hit another emergency read only and since then the file system seems to be completely borked. Mounting it fails with

[  331.445932] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): bch2_journal_write_prep(): fatal error : journal write overran available space, 262200 > 262144 (extra 743 reserved 734/734)
[  331.445940] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): fatal error - emergency read only

even though I am very sure that there are several gigabytes of free space available.

mount.log

koverstreet commented 2 months ago

this is probably fixed on 6.11; we're no longer storing accounting in the journal (it's got its own btree now), which was the most likely culprit.

Are you able to test on 6.11?

Lykos153 commented 1 month ago

Hey there :) I was away during August. I just got back and tried to mount the FS with 6.11rc5 and indeed, it works! Thank you very much!

It took quite some time (~4 minutes) between check_allocations and going read-write, not sure if that is related. EDIT: A consecutive mount was almost instant, so that seems to have been a one-off.

And I still get some soft lockups, but I guess that's a different issue.

FWIW the full log of the successful mount:

dmesg ``` [ 120.648710] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): starting version 1.7: mi_btree_bitmap opts=metadata_replicas=3,data_replicas=2,compression=lz4,foreground_target=ssd,background_target=hdd,promote_target=ssd [ 120.648717] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): recovering from unclean shutdown [ 120.648721] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): Doing compatible version upgrade from 1.7: mi_btree_bitmap to 1.12: rebalance_work_acct_fix running recovery passes: check_allocations [ 178.144242] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): journal read done, replaying entries 3663408-3668754 [ 178.144247] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): dropped unflushed entries 3668755-3668775 [ 208.169087] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [bcachefs:3573] [ 208.169095] Modules linked in: poly1305_generic libpoly1305 poly1305_x86_64 chacha_generic chacha_x86_64 libchacha bcachefs lz4hc_compress lz4_compress xor raid6_pq rfcomm ccm af_packet cmac algif_hash algif_skcipher af_alg bnep nf_log_syslog nft_log nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nf_tables libcrc32c msr nls_iso8859_1 intel_rapl_msr iwlmvm nls_cp437 intel_rapl_common vfat fat cmdlinepart x86_pkg_temp_thermal intel_powerclamp spi_nor mac80211 mtd coretemp i915 libarc4 btusb crct10dif_pclmul crc32_pclmul nouveau polyval_clmulni mei_pxp btrtl mei_hdcp btintel spi_intel_platform polyval_generic iwlwifi at24 ghash_clmulni_intel spi_intel iTCO_wdt btbcm intel_pmc_bxt btmtk watchdog rapl snd_hda_codec_realtek bluetooth cfg80211 snd_hda_codec_hdmi snd_hda_codec_generic intel_cstate snd_hda_scodec_component intel_uncore mxm_wmi drm_gpuvm drm_exec snd_hda_intel gpu_sched snd_intel_dspcfg drm_ttm_helper drm_buddy snd_intel_sdw_acpi snd_hda_codec ttm rfkill [ 208.169159] drm_display_helper snd_hda_core cec snd_hwdep snd_pcm intel_gtt snd_timer mousedev mac_hid joydev evdev uas i2c_algo_bit snd mei_me e1000e video soundcore i2c_i801 mei backlight i2c_mux i2c_smbus ptp wmi pps_core tiny_power_button lpc_ich button sch_fq_codel uinput atkbd libps2 serio vivaldi_fmap loop cpufreq_ondemand tun tap macvlan bridge stp llc kvm_intel kvm fuse configfs efi_pstore nfnetlink dmi_sysfs ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted asn1_encoder tee tpm rng_core libaescfb ecdh_generic ecc input_leds led_class hid_generic usbhid hid usb_storage sd_mod ahci libahci nvme xhci_pci crc32c_intel nvme_core sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel xhci_pci_renesas gf128mul crypto_simd libata scsi_mod cryptd nvme_auth xhci_hcd scsi_common ehci_pci ehci_hcd rtc_cmos dm_mod dax efivarfs autofs4 [ 208.169211] CPU: 3 UID: 0 PID: 3573 Comm: bcachefs Not tainted 6.11.0-rc5 #1-NixOS [ 208.169214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z87 Extreme4, BIOS P3.50 03/11/2018 [ 208.169215] RIP: 0010:journal_sort_key_cmp+0x61/0x100 [bcachefs] [ 208.169323] Code: 00 85 c0 74 13 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9 c3 cc cc cc cc 4c 8b 4f 10 4c 8b 42 10 31 c0 49 8b 71 20 49 8b 48 20 <48> 39 f1 0f 92 c0 48 39 ce 83 d8 00 85 c0 75 cb 49 8b 71 18 49 8b [ 208.169324] RSP: 0018:ffffbb7f0238ba88 EFLAGS: 00000246 [ 208.169326] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff000039800000 [ 208.169327] RDX: ffffbb7f7dc001e0 RSI: ffff000039800000 RDI: ffffbb7f7dc001c8 [ 208.169328] RBP: 0000000000000018 R08: ffff9d2c19d601e8 R09: ffff9d2c189e2b70 [ 208.169329] R10: 0000000000000000 R11: 0000000019dabf10 R12: ffffbb7f0238bb20 [ 208.169330] R13: 00000000000001c8 R14: 00000000000001e0 R15: ffffbb7f7dc00000 [ 208.169331] FS: 00007f3928ba4c80(0000) GS:ffff9d3017780000(0000) knlGS:0000000000000000 [ 208.169332] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 208.169333] CR2: 000000001cc74078 CR3: 000000013955e001 CR4: 00000000001706f0 [ 208.169335] Call Trace: [ 208.169337] [ 208.169340] ? watchdog_timer_fn+0x1f7/0x280 [ 208.169344] ? __pfx_watchdog_timer_fn+0x10/0x10 [ 208.169347] ? __hrtimer_run_queues+0x112/0x2b0 [ 208.169349] ? hrtimer_interrupt+0xfa/0x250 [ 208.169352] ? __sysvec_apic_timer_interrupt+0x58/0x150 [ 208.169355] ? sysvec_apic_timer_interrupt+0x6e/0x80 [ 208.169359] [ 208.169359] [ 208.169360] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 [ 208.169366] ? journal_sort_key_cmp+0x61/0x100 [bcachefs] [ 208.169430] ? __pfx_journal_sort_key_cmp+0x10/0x10 [bcachefs] [ 208.169482] sort_r+0x10c/0x3a0 [ 208.169486] sort+0x31/0x60 [ 208.169487] ? __pfx_journal_sort_key_cmp+0x10/0x10 [bcachefs] [ 208.169533] __journal_keys_sort+0x27/0x180 [bcachefs] [ 208.169576] bch2_journal_keys_sort+0x388/0x3d0 [bcachefs] [ 208.169618] bch2_fs_recovery+0x59f/0x1460 [bcachefs] [ 208.169675] ? vprintk_emit+0xc6/0x280 [ 208.169680] ? __bch2_print+0x87/0xf0 [bcachefs] [ 208.169730] ? bch2_printbuf_exit+0x22/0x40 [bcachefs] [ 208.169782] ? print_mount_opts+0x131/0x190 [bcachefs] [ 208.169831] ? bch2_recalc_capacity+0x106/0x370 [bcachefs] [ 208.169865] bch2_fs_start+0x169/0x270 [bcachefs] [ 208.169914] bch2_fs_get_tree+0x600/0x750 [bcachefs] [ 208.169966] vfs_get_tree+0x29/0xf0 [ 208.169970] path_mount+0x4d8/0xb10 [ 208.169975] __x64_sys_mount+0x122/0x160 [ 208.169978] do_syscall_64+0xb7/0x210 [ 208.169982] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 208.169985] RIP: 0033:0x7f3928cd470e [ 208.169999] Code: 48 8b 0d 1d 67 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ea 66 0d 00 f7 d8 64 89 01 48 [ 208.170000] RSP: 002b:00007ffd5c4d4d18 EFLAGS: 00000297 ORIG_RAX: 00000000000000a5 [ 208.170002] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f3928cd470e [ 208.170003] RDX: 00005610dc6b5160 RSI: 00005610dc6b2e60 RDI: 00005610dc6b6e90 [ 208.170004] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 208.170005] R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000000 [ 208.170006] R13: 0000000000000055 R14: 00005610dc6b6e90 R15: 00005610dc6b5160 [ 208.170008] [ 236.168628] watchdog: BUG: soft lockup - CPU#3 stuck for 52s! [bcachefs:3573] [ 236.168634] Modules linked in: poly1305_generic libpoly1305 poly1305_x86_64 chacha_generic chacha_x86_64 libchacha bcachefs lz4hc_compress lz4_compress xor raid6_pq rfcomm ccm af_packet cmac algif_hash algif_skcipher af_alg bnep nf_log_syslog nft_log nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nf_tables libcrc32c msr nls_iso8859_1 intel_rapl_msr iwlmvm nls_cp437 intel_rapl_common vfat fat cmdlinepart x86_pkg_temp_thermal intel_powerclamp spi_nor mac80211 mtd coretemp i915 libarc4 btusb crct10dif_pclmul crc32_pclmul nouveau polyval_clmulni mei_pxp btrtl mei_hdcp btintel spi_intel_platform polyval_generic iwlwifi at24 ghash_clmulni_intel spi_intel iTCO_wdt btbcm intel_pmc_bxt btmtk watchdog rapl snd_hda_codec_realtek bluetooth cfg80211 snd_hda_codec_hdmi snd_hda_codec_generic intel_cstate snd_hda_scodec_component intel_uncore mxm_wmi drm_gpuvm drm_exec snd_hda_intel gpu_sched snd_intel_dspcfg drm_ttm_helper drm_buddy snd_intel_sdw_acpi snd_hda_codec ttm rfkill [ 236.168692] drm_display_helper snd_hda_core cec snd_hwdep snd_pcm intel_gtt snd_timer mousedev mac_hid joydev evdev uas i2c_algo_bit snd mei_me e1000e video soundcore i2c_i801 mei backlight i2c_mux i2c_smbus ptp wmi pps_core tiny_power_button lpc_ich button sch_fq_codel uinput atkbd libps2 serio vivaldi_fmap loop cpufreq_ondemand tun tap macvlan bridge stp llc kvm_intel kvm fuse configfs efi_pstore nfnetlink dmi_sysfs ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted asn1_encoder tee tpm rng_core libaescfb ecdh_generic ecc input_leds led_class hid_generic usbhid hid usb_storage sd_mod ahci libahci nvme xhci_pci crc32c_intel nvme_core sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel xhci_pci_renesas gf128mul crypto_simd libata scsi_mod cryptd nvme_auth xhci_hcd scsi_common ehci_pci ehci_hcd rtc_cmos dm_mod dax efivarfs autofs4 [ 236.168740] CPU: 3 UID: 0 PID: 3573 Comm: bcachefs Tainted: G L 6.11.0-rc5 #1-NixOS [ 236.168742] Tainted: [L]=SOFTLOCKUP [ 236.168743] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z87 Extreme4, BIOS P3.50 03/11/2018 [ 236.168744] RIP: 0010:journal_sort_key_cmp+0x71/0x100 [bcachefs] [ 236.168802] Code: 45 31 c9 c3 cc cc cc cc 4c 8b 4f 10 4c 8b 42 10 31 c0 49 8b 71 20 49 8b 48 20 48 39 f1 0f 92 c0 48 39 ce 83 d8 00 85 c0 75 cb <49> 8b 71 18 49 8b 48 18 31 c0 48 39 f1 0f 92 c0 48 39 ce 83 d8 00 [ 236.168803] RSP: 0018:ffffbb7f0238ba88 EFLAGS: 00000246 [ 236.168805] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001 [ 236.168806] RDX: ffffbb7f7dc359d0 RSI: 0000000000000001 RDI: ffffbb7f7dc359b8 [ 236.168807] RBP: 0000000000000018 R08: ffff9d2ba6d18bc8 R09: ffff9d2bacd098e8 [ 236.168808] R10: 0000000000000018 R11: 0000000019dabf10 R12: ffffbb7f0238bb20 [ 236.168808] R13: 00000000000359b8 R14: 00000000000359d0 R15: ffffbb7f7dc00000 [ 236.168809] FS: 00007f3928ba4c80(0000) GS:ffff9d3017780000(0000) knlGS:0000000000000000 [ 236.168811] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 236.168812] CR2: 000000001cc74078 CR3: 000000013955e001 CR4: 00000000001706f0 [ 236.168813] Call Trace: [ 236.168816] [ 236.168818] ? watchdog_timer_fn+0x1f7/0x280 [ 236.168821] ? __pfx_watchdog_timer_fn+0x10/0x10 [ 236.168823] ? __hrtimer_run_queues+0x112/0x2b0 [ 236.168826] ? hrtimer_interrupt+0xfa/0x250 [ 236.168828] ? __sysvec_apic_timer_interrupt+0x58/0x150 [ 236.168831] ? sysvec_apic_timer_interrupt+0x6e/0x80 [ 236.168834] [ 236.168834] [ 236.168835] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 [ 236.168839] ? journal_sort_key_cmp+0x71/0x100 [bcachefs] [ 236.168881] ? __pfx_journal_sort_key_cmp+0x10/0x10 [bcachefs] [ 236.168921] sort_r+0x10c/0x3a0 [ 236.168924] sort+0x31/0x60 [ 236.168925] ? __pfx_journal_sort_key_cmp+0x10/0x10 [bcachefs] [ 236.168965] __journal_keys_sort+0x27/0x180 [bcachefs] [ 236.169005] bch2_journal_keys_sort+0x388/0x3d0 [bcachefs] [ 236.169047] bch2_fs_recovery+0x59f/0x1460 [bcachefs] [ 236.169101] ? vprintk_emit+0xc6/0x280 [ 236.169104] ? __bch2_print+0x87/0xf0 [bcachefs] [ 236.169152] ? bch2_printbuf_exit+0x22/0x40 [bcachefs] [ 236.169203] ? print_mount_opts+0x131/0x190 [bcachefs] [ 236.169250] ? bch2_recalc_capacity+0x106/0x370 [bcachefs] [ 236.169284] bch2_fs_start+0x169/0x270 [bcachefs] [ 236.169330] bch2_fs_get_tree+0x600/0x750 [bcachefs] [ 236.169381] vfs_get_tree+0x29/0xf0 [ 236.169386] path_mount+0x4d8/0xb10 [ 236.169389] __x64_sys_mount+0x122/0x160 [ 236.169392] do_syscall_64+0xb7/0x210 [ 236.169395] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 236.169397] RIP: 0033:0x7f3928cd470e [ 236.169409] Code: 48 8b 0d 1d 67 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ea 66 0d 00 f7 d8 64 89 01 48 [ 236.169410] RSP: 002b:00007ffd5c4d4d18 EFLAGS: 00000297 ORIG_RAX: 00000000000000a5 [ 236.169412] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f3928cd470e [ 236.169413] RDX: 00005610dc6b5160 RSI: 00005610dc6b2e60 RDI: 00005610dc6b6e90 [ 236.169414] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 236.169415] R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000000 [ 236.169416] R13: 0000000000000055 R14: 00005610dc6b6e90 R15: 00005610dc6b5160 [ 236.169418] [ 243.025123] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): accounting_read... done [ 243.167041] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): alloc_read... done [ 243.308373] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): stripes_read... done [ 243.308379] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): snapshots_read... done [ 243.308386] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): check_allocations... done [ 569.417636] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): going read-write [ 569.449096] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): journal_replay... done [ 640.811221] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): resume_logged_ops... done [ 640.811229] bcachefs (677cf0a7-1abe-4ce3-876c-2ca63301229d): delete_dead_inodes... done ```
koverstreet commented 1 month ago

the ~4 minutes for check_allocations would've been the disk format upgrade

I just commited a patch to help a bit with that soft lockup - that's a trick one to deal with completely, though