koverstreet / bcachefs

Other
692 stars 72 forks source link

[6.11.0-rc1] «trying to move an extent, but nr_replicas=0» during evacuate (panic) #726

Closed ticpu closed 2 months ago

ticpu commented 2 months ago

I was running device evacuate on bcachefs-for-upstream branch (58474f76a770) and my machine suddenly rebooted. My journal was completely overwritten with this trace.

❯ journalctl -b-1 -k | wc -l 16722739

❯ journalctl -b-1 -k | grep "trying to move an extent, but nr_replicas=0" | wc -l 263178

aoû 16 12:30:59 p4 kernel: ------------[ cut here ]------------
aoû 16 12:30:59 p4 kernel: trying to move an extent, but nr_replicas=0
                            u64s 10 type extent 806209255:4392:4294957955 len 32 ver 129844006: durability: 1 crc: c_size 32 size 32 offset 0 nonce 0 csum chacha20_poly1305_80 4814:5fec16c4fdc3a1c7  compress incompressible ptr: 0:529352:464 gen 76 cached stale ptr: 2:3957928:8 gen 0 ptr: 3:3652520:480 gen 2
                            rewrite ptrs:        100
                            kill ptrs:        0
                            target:        none
                            compression:        zstd
                            extra replicas:        0
aoû 16 12:30:59 p4 kernel: WARNING: CPU: 24 PID: 16844 at fs/bcachefs/data_update.c:682 bch2_data_update_init+0xe37/0x1460 [bcachefs]
aoû 16 12:30:59 p4 kernel: Modules linked in: raid1 poly1305_generic libpoly1305 poly1305_x86_64 chacha_generic chacha_x86_64 libchacha bcachefs lz4hc_compress lz4_compress dm_crypt macvtap vhost_net vhost vhost_iotlb tap tun vfat fat nft_masq nft_ct nft_reject_ipv6 nf_reject_ipv6 nft_reject_ipv4 nf_reject_ipv4 nft_reject nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables bridge stp llc amd_atl intel_rapl_msr intel_rapl_common macvlan ecryptfs nvidia(POE) snd_hda_codec_realtek kvm_amd snd_hda_codec_generic amdgpu snd_hda_scodec_component snd_hda_codec_hdmi kvm snd_hda_intel snd_usb_audio drm_exec snd_intel_dspcfg amdxcp snd_intel_sdw_acpi drm_buddy crct10dif_pclmul snd_usbmidi_lib snd_hda_codec gpu_sched snd_ump crc32_pclmul drm_suballoc_helper snd_hda_core uvcvideo polyval_clmulni i2c_algo_bit snd_rawmidi polyval_generic snd_hwdep drm_ttm_helper snd_seq_device sp5100_tco videobuf2_vmalloc ttm ghash_clmulni_intel uvc snd_pcm sha512_ssse3 videobuf2_memops drm_display_helper snd_timer videobuf2_v4l2 i2c_piix4
aoû 16 12:30:59 p4 kernel:  sha1_ssse3 cec xpad snd r8169 pcspkr wmi_bmof rapl ff_memless ccp crc16 k10temp i2c_smbus videobuf2_common mousedev realtek soundcore cfg80211 joydev video gpio_amdpt wmi gpio_generic rfkill mac_hid aesni_intel v4l2loopback(OE) gf128mul videodev crypto_simd cryptd mc cbc usbip_host encrypted_keys usbip_core trusted asn1_encoder kvmfr(OE) crypto_user loop tee fuse nfnetlink ip_tables x_tables raid0 dm_raid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx md_mod dm_mod hid_logitech_dj hid_logitech_hidpp uas usb_storage hid_generic usbhid btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq nvme crc32c_intel sha256_ssse3 nvme_core xhci_pci xhci_pci_renesas nvme_auth vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd
aoû 16 12:30:59 p4 kernel: Unloaded tainted modules: nvidia_modeset(POE):1 nvidia_drm(POE):1 [last unloaded: nvidia_modeset(POE)]
aoû 16 12:30:59 p4 kernel: CPU: 24 UID: 0 PID: 16844 Comm: bcachefs Tainted: P        W  OE      6.11.0-rc1-1-bcachefs-git-00028-g58474f76a770 #2 6b9f4124185c5bda783d8ca631f05a62caced4ae
aoû 16 12:30:59 p4 kernel: Tainted: [P]=PROPRIETARY_MODULE, [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
aoû 16 12:30:59 p4 kernel: Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 2.02 11/17/2023
aoû 16 12:30:59 p4 kernel: RIP: 0010:bch2_data_update_init+0xe37/0x1460 [bcachefs]
aoû 16 12:30:59 p4 kernel: Code: 49 8b b6 c8 00 00 00 49 8d 4e 70 48 89 df 49 8d 96 04 01 00 00 e8 09 ed ff ff 48 8b 75 a0 48 c7 c7 40 ff a9 c5 e8 c9 30 72 f2 <0f> 0b 48 89 df e8 5f b6 06 00 4c 89 f7 e8 97 ea ff ff 41 bc 56 f7
aoû 16 12:30:59 p4 kernel: RSP: 0018:ffffbbe2270736d0 EFLAGS: 00010282
aoû 16 12:30:59 p4 kernel: RAX: 0000000000000000 RBX: ffffbbe227073810 RCX: 0000000000000027
aoû 16 12:30:59 p4 kernel: RDX: ffff90f6be421a48 RSI: 0000000000000001 RDI: ffff90f6be421a40
aoû 16 12:30:59 p4 kernel: RBP: ffffbbe227073870 R08: 0000000000000000 R09: ffffbbe227073550
aoû 16 12:30:59 p4 kernel: R10: ffff90f73dd91c50 R11: 0000000000000003 R12: ffffbbe227073cf0
aoû 16 12:30:59 p4 kernel: R13: 0000000000000200 R14: ffff90e7e7f97200 R15: ffff90eb49b75050
aoû 16 12:30:59 p4 kernel: FS:  0000000000000000(0000) GS:ffff90f6be400000(0000) knlGS:0000000000000000
aoû 16 12:30:59 p4 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
aoû 16 12:30:59 p4 kernel: CR2: 000000c002d0f030 CR3: 0000000002c20000 CR4: 0000000000f50ef0
aoû 16 12:30:59 p4 kernel: PKRU: 55555554
aoû 16 12:30:59 p4 kernel: Call Trace:
aoû 16 12:30:59 p4 kernel:  <TASK>
aoû 16 12:30:59 p4 kernel:  ? bch2_data_update_init+0xe37/0x1460 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? __warn.cold+0x8e/0xe8
aoû 16 12:30:59 p4 kernel:  ? bch2_data_update_init+0xe37/0x1460 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? report_bug+0xfb/0x140
aoû 16 12:30:59 p4 kernel:  ? handle_bug+0x38/0x70
aoû 16 12:30:59 p4 kernel:  ? exc_invalid_op+0x17/0x60
aoû 16 12:30:59 p4 kernel:  ? asm_exc_invalid_op+0x1a/0x20
aoû 16 12:30:59 p4 kernel:  ? bch2_data_update_init+0xe37/0x1460 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? get_page_from_freelist+0x17af/0x1a30
aoû 16 12:30:59 p4 kernel:  ? bch2_move_extent+0x3d3/0x9a0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  bch2_move_extent+0x3d3/0x9a0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? bch2_btree_iter_peek_upto+0x545/0xf90 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? bch2_trans_begin+0x57a/0x7e0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? bch2_move_data_btree+0x441/0x550 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  bch2_move_data_btree+0x441/0x550 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? move_write_done+0x60/0x60 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? bch2_move_data_btree+0x1ac/0x550 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? __bch2_move_data+0xea/0x1f0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  __bch2_move_data+0xea/0x1f0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? move_write_done+0x60/0x60 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? move_write_done+0x60/0x60 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? bch2_dev_usage_read+0x70/0x70 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  bch2_move_data+0x96/0xd0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? bch2_move_data+0x5e/0xd0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  bch2_data_job+0x157/0x2d0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  bch2_data_thread+0x4a/0x60 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  kthread+0xcf/0x100
aoû 16 12:30:59 p4 kernel:  ? kthread_park+0x80/0x80
aoû 16 12:30:59 p4 kernel:  ret_from_fork+0x31/0x50
aoû 16 12:30:59 p4 kernel:  ? kthread_park+0x80/0x80
aoû 16 12:30:59 p4 kernel:  ret_from_fork_asm+0x11/0x20
aoû 16 12:30:59 p4 kernel:  </TASK>
aoû 16 12:30:59 p4 kernel: ---[ end trace 0000000000000000 ]---
aoû 16 12:30:59 p4 kernel: ------------[ cut here ]------------
aoû 16 12:30:59 p4 kernel: trying to move an extent, but nr_replicas=0
                            u64s 11 type extent 806209255:4776:4294957955 len 32 ver 129844009: durability: 1 ptr: 3:3882613:0 gen 0 crc: c_size 32 size 32 offset 0 nonce 64 csum chacha20_poly1305_80 dc85:a6b2fc523b6b865a  compress incompressible ptr: 0:529645:280 gen 97 cached stale ptr: 2:3957928:392 gen 0 ptr: 4:1829333:352 gen 2 cached
                            rewrite ptrs:        1
                            kill ptrs:        0
                            target:        none
                            compression:        zstd
                            extra replicas:        0
aoû 16 12:30:59 p4 kernel: WARNING: CPU: 24 PID: 16844 at fs/bcachefs/data_update.c:682 bch2_data_update_init+0xe37/0x1460 [bcachefs]
aoû 16 12:30:59 p4 kernel: Modules linked in: raid1 poly1305_generic libpoly1305 poly1305_x86_64 chacha_generic chacha_x86_64 libchacha bcachefs lz4hc_compress lz4_compress dm_crypt macvtap vhost_net vhost vhost_iotlb tap tun vfat fat nft_masq nft_ct nft_reject_ipv6 nf_reject_ipv6 nft_reject_ipv4 nf_reject_ipv4 nft_reject nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables bridge stp llc amd_atl intel_rapl_msr intel_rapl_common macvlan ecryptfs nvidia(POE) snd_hda_codec_realtek kvm_amd snd_hda_codec_generic amdgpu snd_hda_scodec_component snd_hda_codec_hdmi kvm snd_hda_intel snd_usb_audio drm_exec snd_intel_dspcfg amdxcp snd_intel_sdw_acpi drm_buddy crct10dif_pclmul snd_usbmidi_lib snd_hda_codec gpu_sched snd_ump crc32_pclmul drm_suballoc_helper snd_hda_core uvcvideo polyval_clmulni i2c_algo_bit snd_rawmidi polyval_generic snd_hwdep drm_ttm_helper snd_seq_device sp5100_tco videobuf2_vmalloc ttm ghash_clmulni_intel uvc snd_pcm sha512_ssse3 videobuf2_memops drm_display_helper snd_timer videobuf2_v4l2 i2c_piix4
aoû 16 12:30:59 p4 kernel:  sha1_ssse3 cec xpad snd r8169 pcspkr wmi_bmof rapl ff_memless ccp crc16 k10temp i2c_smbus videobuf2_common mousedev realtek soundcore cfg80211 joydev video gpio_amdpt wmi gpio_generic rfkill mac_hid aesni_intel v4l2loopback(OE) gf128mul videodev crypto_simd cryptd mc cbc usbip_host encrypted_keys usbip_core trusted asn1_encoder kvmfr(OE) crypto_user loop tee fuse nfnetlink ip_tables x_tables raid0 dm_raid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx md_mod dm_mod hid_logitech_dj hid_logitech_hidpp uas usb_storage hid_generic usbhid btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq nvme crc32c_intel sha256_ssse3 nvme_core xhci_pci xhci_pci_renesas nvme_auth vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd
aoû 16 12:30:59 p4 kernel: Unloaded tainted modules: nvidia_modeset(POE):1 nvidia_drm(POE):1 [last unloaded: nvidia_modeset(POE)]
aoû 16 12:30:59 p4 kernel: CPU: 24 UID: 0 PID: 16844 Comm: bcachefs Tainted: P        W  OE      6.11.0-rc1-1-bcachefs-git-00028-g58474f76a770 #2 6b9f4124185c5bda783d8ca631f05a62caced4ae
aoû 16 12:30:59 p4 kernel: Tainted: [P]=PROPRIETARY_MODULE, [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
aoû 16 12:30:59 p4 kernel: Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 2.02 11/17/2023
aoû 16 12:30:59 p4 kernel: RIP: 0010:bch2_data_update_init+0xe37/0x1460 [bcachefs]
aoû 16 12:30:59 p4 kernel: Code: 49 8b b6 c8 00 00 00 49 8d 4e 70 48 89 df 49 8d 96 04 01 00 00 e8 09 ed ff ff 48 8b 75 a0 48 c7 c7 40 ff a9 c5 e8 c9 30 72 f2 <0f> 0b 48 89 df e8 5f b6 06 00 4c 89 f7 e8 97 ea ff ff 41 bc 56 f7
aoû 16 12:30:59 p4 kernel: RSP: 0018:ffffbbe2270736d0 EFLAGS: 00010282
aoû 16 12:30:59 p4 kernel: RAX: 0000000000000000 RBX: ffffbbe227073810 RCX: 0000000000000027
aoû 16 12:30:59 p4 kernel: RDX: ffff90f6be421a48 RSI: 0000000000000001 RDI: ffff90f6be421a40
aoû 16 12:30:59 p4 kernel: RBP: ffffbbe227073870 R08: 0000000000000000 R09: ffffbbe227073550
aoû 16 12:30:59 p4 kernel: R10: ffff90f73dd92f90 R11: 0000000000000003 R12: ffffbbe227073cf0
aoû 16 12:30:59 p4 kernel: R13: 0000000000000200 R14: ffff90e7e7f97200 R15: ffff90eb49b75058
aoû 16 12:30:59 p4 kernel: FS:  0000000000000000(0000) GS:ffff90f6be400000(0000) knlGS:0000000000000000
aoû 16 12:30:59 p4 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
aoû 16 12:30:59 p4 kernel: CR2: 000000c002d0f030 CR3: 0000000002c20000 CR4: 0000000000f50ef0
aoû 16 12:30:59 p4 kernel: PKRU: 55555554
aoû 16 12:30:59 p4 kernel: Call Trace:
aoû 16 12:30:59 p4 kernel:  <TASK>
aoû 16 12:30:59 p4 kernel:  ? bch2_data_update_init+0xe37/0x1460 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? __warn.cold+0x8e/0xe8
aoû 16 12:30:59 p4 kernel:  ? bch2_data_update_init+0xe37/0x1460 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? report_bug+0xfb/0x140
aoû 16 12:30:59 p4 kernel:  ? handle_bug+0x38/0x70
aoû 16 12:30:59 p4 kernel:  ? exc_invalid_op+0x17/0x60
aoû 16 12:30:59 p4 kernel:  ? asm_exc_invalid_op+0x1a/0x20
aoû 16 12:30:59 p4 kernel:  ? bch2_data_update_init+0xe37/0x1460 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? get_page_from_freelist+0x17af/0x1a30
aoû 16 12:30:59 p4 kernel:  ? bch2_move_extent+0x3d3/0x9a0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  bch2_move_extent+0x3d3/0x9a0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? bch2_btree_iter_peek_upto+0x545/0xf90 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? bch2_trans_begin+0x5ca/0x7e0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? bch2_move_data_btree+0x441/0x550 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  bch2_move_data_btree+0x441/0x550 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? move_write_done+0x60/0x60 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? bch2_move_data_btree+0x1ac/0x550 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? __bch2_move_data+0xea/0x1f0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  __bch2_move_data+0xea/0x1f0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? move_write_done+0x60/0x60 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? move_write_done+0x60/0x60 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? bch2_dev_usage_read+0x70/0x70 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  bch2_move_data+0x96/0xd0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  ? bch2_move_data+0x5e/0xd0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  bch2_data_job+0x157/0x2d0 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  bch2_data_thread+0x4a/0x60 [bcachefs 7040509e7d3ea815c2b6184fca9a3fe7b67bdf6a]
aoû 16 12:30:59 p4 kernel:  kthread+0xcf/0x100
aoû 16 12:30:59 p4 kernel:  ? kthread_park+0x80/0x80
aoû 16 12:30:59 p4 kernel:  ret_from_fork+0x31/0x50
aoû 16 12:30:59 p4 kernel:  ? kthread_park+0x80/0x80
aoû 16 12:30:59 p4 kernel:  ret_from_fork_asm+0x11/0x20
aoû 16 12:30:59 p4 kernel:  </TASK>
aoû 16 12:30:59 p4 kernel: ---[ end trace 0000000000000000 ]---
koverstreet commented 2 months ago

There's a patch in the testing branch that should get us a bit more info

ticpu commented 2 months ago

Evacuate seems to run fine with commit cf1f34910cf405c176c41b97243e49b0398ee3f8, it went past the extent that was causing the issue. Now at 872919485.