openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.7k stars 1.76k forks source link

kernel: PANIC <pool> blkptr at <address> has invalid CHECKSUM 0 #13926

Open flippers2652 opened 2 years ago

flippers2652 commented 2 years ago

System information

Type Version/Name
Distribution Name Arch
Distribution Version Latest
Kernel Version 5.15.65-1-lts
Architecture x64
OpenZFS Version zfs-2.1.5-1

Describe the problem you're observing

I am currently running my root on ZFS. It imports fine and the computer usually operates normally however after a while I receive an error through journalctl (See Below) and the computer starts to freeze with the desktop environment usually going first.

I have scrubbed but it says there are no errors: scan: scrub repaired 0B in 00:39:01 with 0 errors on Sun Sep 18 17:54:00 2022 Similarly,

zpool status -x returns all pools are healthy

When I search the error it seems only to have the message coming up when the pool is imported, however, my pool imports perfectly fine. I tried asking for help on Reddit but no one knew what it was so one person suggested submitting a bug report.

Thank you for your help.

Describe how to reproduce the problem

Unknown.

Include any warning/errors/backtraces from the system logs

Sep 19 18:48:44 Adelie kernel: PANIC: rpool: blkptr at 00000000ef675ad8 has invalid CHECKSUM 0
Sep 19 18:48:44 Adelie kernel: Showing stack for process 71512
Sep 19 18:48:44 Adelie kernel: CPU: 6 PID: 71512 Comm: z_wr_int_0 Tainted: P           OE     5.15.65-1-lts #1 cc2835ecad4053a59e930d75aa6d41a8591dbaf1
Sep 19 18:48:44 Adelie kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B93/MPG X570 GAMING PRO CARBON WIFI (MS-7B93), BIOS 1.60 01/14/2020
Sep 19 18:48:44 Adelie kernel: Call Trace:
Sep 19 18:48:44 Adelie kernel:  <TASK>
Sep 19 18:48:44 Adelie kernel:  dump_stack_lvl+0x45/0x5b
Sep 19 18:48:44 Adelie kernel:  vcmn_err.cold+0x50/0x68 [spl 5dcaec7c8f893e3fb2456dcfc9be1c3548e7902b]
Sep 19 18:48:44 Adelie kernel:  zfs_panic_recover+0x75/0x90 [zfs 3d188c1adc103915b81444415b2a9c68b430b299]
Sep 19 18:48:44 Adelie kernel:  zfs_blkptr_verify_log+0xab/0x110 [zfs 3d188c1adc103915b81444415b2a9c68b430b299]
Sep 19 18:48:44 Adelie kernel:  zfs_blkptr_verify+0x351/0x450 [zfs 3d188c1adc103915b81444415b2a9c68b430b299]
Sep 19 18:48:44 Adelie kernel:  zio_free+0x21/0xf0 [zfs 3d188c1adc103915b81444415b2a9c68b430b299]
Sep 19 18:48:44 Adelie kernel:  dsl_dataset_block_kill+0x4a6/0x4d0 [zfs 3d188c1adc103915b81444415b2a9c68b430b299]
Sep 19 18:48:44 Adelie kernel:  dbuf_write_done+0x1b2/0x1f0 [zfs 3d188c1adc103915b81444415b2a9c68b430b299]
Sep 19 18:48:44 Adelie kernel:  arc_write_done+0x92/0x440 [zfs 3d188c1adc103915b81444415b2a9c68b430b299]
Sep 19 18:48:44 Adelie kernel:  zio_done+0x2b1/0x10d0 [zfs 3d188c1adc103915b81444415b2a9c68b430b299]
Sep 19 18:48:44 Adelie kernel:  ? kfree+0x221/0x250
Sep 19 18:48:44 Adelie kernel:  zio_execute+0x83/0x120 [zfs 3d188c1adc103915b81444415b2a9c68b430b299]
Sep 19 18:48:44 Adelie kernel:  taskq_thread+0x2e3/0x510 [spl 5dcaec7c8f893e3fb2456dcfc9be1c3548e7902b]
Sep 19 18:48:44 Adelie kernel:  ? wake_up_q+0x90/0x90
Sep 19 18:48:44 Adelie kernel:  ? zio_ddt_child_write_done+0xd0/0xd0 [zfs 3d188c1adc103915b81444415b2a9c68b430b299]
Sep 19 18:48:44 Adelie kernel:  ? taskq_thread_spawn+0x50/0x50 [spl 5dcaec7c8f893e3fb2456dcfc9be1c3548e7902b]
Sep 19 18:48:44 Adelie kernel:  kthread+0x118/0x140
Sep 19 18:48:44 Adelie kernel:  ? set_kthread_struct+0x50/0x50
Sep 19 18:48:44 Adelie kernel:  ret_from_fork+0x22/0x30
Sep 19 18:48:44 Adelie kernel:  </TASK>
ryao commented 2 years ago

This indicates that the in-memory checksum field is corrupt. ZFS responded by panicking to protect your data integrity.

Do you always receive the same error?

Would you be willing to recompile the ZFS kernel modules with some debug flags? In specific, ./configure --enable-debug --enable-debuginfo --enable-asan --enable-ubsan.

Honestly, I would prefer to recompile the kernel with ASAN and UBSAN here, but I realize that might be a bit too much.

UBSAN and ASAN should hopefully provide more detailed information about what is happening.

flippers2652 commented 2 years ago

I am willing to compile it and have reached the stage I type ./configure --enable-debug --enable-debuginfo --enable-asan --enable-ubsan however I don't know what packages are required to build on arch and also how to actually install it once built.

ryao commented 2 years ago

It looks like Arch Linux uses DKMS. There should be a file in /etc/default/zfs that has the following variables:

You want to change them to yes from no.

After doing that, you can rebuild the modules:

https://wiki.archlinux.org/title/Dynamic_Kernel_Module_Support#Rebuild_modules

ryao commented 2 years ago

I was mistaken. --enable-asan --enable-ubsan does not apply to the kernel. Getting that in the kernel requires rebuilding the kernel.

I simplified my instructions to omit the ASAN and UBSAN parts.

flippers2652 commented 2 years ago

I have now done that.

ryao commented 2 years ago

I have now done that.

Okay. Assuming that you have regenerated your initramfs (I forgot to mention this earlier) and rebooted, now we wait.

flippers2652 commented 2 years ago

Where would I get any new infomation?

ryao commented 2 years ago

dmesg, although if the system locks up, that might be difficult. I suggest setting up pstore:

https://www.kernel.org/doc/html/latest/admin-guide/pstore-blk.html

flippers2652 commented 2 years ago

I don't see anything new in dmesg.

ryao commented 2 years ago

@flippers2652 Something has to go wrong first, or did something already go wrong?

flippers2652 commented 2 years ago

It still just only outputs the error from earlier.

ryao commented 2 years ago

Well, that tells me something, but not as much as I wanted to know.

Are you using ZFS encryption?

flippers2652 commented 2 years ago

I am using aes-256-gcm encryption.

ryao commented 2 years ago

a1d477c24c7badc89c60955995fd84d311938486 is what added the check that is becoming upset. I am not sure how the checksum is becoming corrupted. The only code path that zeroes this is in the encryption code, but that is only for readonly pools with an early version of encryption support. Are there any other third party kernel modules on this system?

Also, there is an issue with the backtrace from your system. It says that dsl_dataset_block_kill called zio_free(), but that is not possible. In fact, the stack is nonsensical. That suggests an issue in how the back traces were generated. I wonder if these modules are somehow still being stripped, or if the debuginfo is not being generated. The instructions I gave you should have fixed that. :/

When this happens, is it possible to retrieve the contains of /proc/spl/kstat/zfs/dbgmsg via SSH?

ryao commented 2 years ago

Actually, I see another problem. 00000000ef675ad8 is a userspace pointer. This should not be possible. :/

rincebrain commented 2 years ago

Whenever I see something in arc_write with wild, impossible contents, I think of #11679, which essentially (to my current understanding) involves an incorrect refcount resulting in a buffer being discoverable when the code assumes it can't be and can be manipulated freely, and then woe betides anyone who loses that race and peers inside at the wrong time.

Could be an orthogonal failure, of course, but that's where I become suspicious whenever I see that and encryption together.

flippers2652 commented 2 years ago

a1d477c is what added the check that is becoming upset. I am not sure how the checksum is becoming corrupted. The only code path that zeroes this is in the encryption code, but that is only for readonly pools with an early version of encryption support. Are there any other third party kernel modules on this system?

I don't think so·

snd_seq_dummy          16384  0
snd_hrtimer            16384  1
snd_seq                94208  7 snd_seq_dummy
snd_seq_device         16384  1 snd_seq
rfcomm                 90112  16
ccm                    20480  6
cmac                   16384  3
algif_hash             16384  1
algif_skcipher         16384  1
af_alg                 36864  6 algif_hash,algif_skcipher
bnep                   32768  2
intel_rapl_msr         20480  0
intel_rapl_common      32768  1 intel_rapl_msr
edac_mce_amd           53248  0
iwlmvm                495616  0
ccp                   126976  0
rng_core               16384  1 ccp
kvm                  1085440  0
snd_hda_codec_realtek   167936  1
snd_hda_codec_generic    98304  1 snd_hda_codec_realtek
irqbypass              16384  1 kvm
ledtrig_audio          16384  1 snd_hda_codec_generic
crct10dif_pclmul       16384  1
snd_hda_codec_hdmi     81920  1
crc32_pclmul           16384  0
crc32c_intel           24576  0
ghash_clmulni_intel    16384  0
snd_hda_intel          61440  2
aesni_intel           380928  8
snd_intel_dspcfg       32768  1 snd_hda_intel
wmi_bmof               16384  0
mac80211             1216512  1 iwlmvm
snd_intel_sdw_acpi     20480  1 snd_intel_dspcfg
crypto_simd            16384  1 aesni_intel
snd_hda_codec         184320  4 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec_realtek
cryptd                 28672  3 crypto_simd,ghash_clmulni_intel
vfat                   24576  1
fat                    90112  1 vfat
amdgpu               8548352  63
libarc4                16384  1 mac80211
rapl                   16384  0
snd_hda_core          118784  5 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek
pcspkr                 16384  0
snd_hwdep              16384  1 snd_hda_codec
k10temp                16384  0
snd_pcm               159744  4 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_core
iwlwifi               454656  1 iwlmvm
btusb                  65536  0
snd_timer              49152  3 snd_seq,snd_hrtimer,snd_pcm
btrtl                  28672  1 btusb
btbcm                  20480  1 btusb
snd                   131072  16 snd_hda_codec_generic,snd_seq,snd_seq_device,snd_hda_codec_hdmi,snd_hwdep,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek,snd_timer,snd_pcm
btintel                45056  1 btusb
gpu_sched              49152  1 amdgpu
drm_ttm_helper         16384  1 amdgpu
soundcore              16384  1 snd
joydev                 28672  0
bluetooth             745472  45 btrtl,btintel,btbcm,bnep,btusb,rfcomm
mousedev               24576  0
sp5100_tco             20480  0
ttm                    86016  2 amdgpu,drm_ttm_helper
cfg80211             1056768  3 iwlmvm,iwlwifi,mac80211
i2c_piix4              36864  0
ecdh_generic           16384  2 bluetooth
igb                   290816  0
crc16                  16384  1 bluetooth
dca                    16384  1 igb
rfkill                 36864  9 bluetooth,cfg80211
wmi                    45056  1 wmi_bmof
mac_hid                16384  0
pinctrl_amd            32768  0
acpi_cpufreq           32768  0
crypto_user            24576  0
fuse                  172032  1
bpf_preload            16384  0
ip_tables              36864  0
x_tables               57344  1 ip_tables
usbhid                 73728  0
zfs                  4288512  13
zunicode              335872  1 zfs
zzstd                 577536  1 zfs
zlua                  188416  1 zfs
zavl                   16384  1 zfs
icp                   335872  1 zfs
zcommon               102400  2 zfs,icp
znvpair               110592  2 zfs,zcommon
spl                   122880  6 zfs,icp,zzstd,znvpair,zcommon,zavl
xhci_pci               20480  0
xhci_pci_renesas       24576  1 xhci_pci

When this happens, is it possible to retrieve the contains of /proc/spl/kstat/zfs/dbgmsg via SSH?


timestamp    message 
1663679572   spa.c:6247:spa_tryimport(): spa_tryimport: importing rpool
1663679572   spa.c:6252:spa_tryimport(): spa_tryimport: using cachefile '/etc/zfs/zpool.cache.org'
1663679572   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): LOADING
1663679572   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/nvme1n1p1': best uberblock found for spa $import. txg 8540642
1663679572   spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): using uberblock with txg=8540642
1663679572   spa.c:8370:spa_async_request(): spa=$import async request task=2048
1663679572   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): LOADED
1663679572   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): UNLOADING
1663679572   spa.c:6247:spa_tryimport(): spa_tryimport: importing rpool
1663679572   spa.c:6252:spa_tryimport(): spa_tryimport: using cachefile '/etc/zfs/zpool.cache.org'
1663679572   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): LOADING
1663679572   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/nvme1n1p1': best uberblock found for spa $import. txg 8540642
1663679572   spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): using uberblock with txg=8540642
1663679572   spa.c:8370:spa_async_request(): spa=$import async request task=2048
1663679572   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): LOADED
1663679572   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): UNLOADING
1663679572   spa.c:6099:spa_import(): spa_import: importing rpool
1663679572   spa_misc.c:418:spa_load_note(): spa_load(rpool, config trusted): LOADING
1663679572   vdev.c:152:vdev_dbgmsg(): disk vdev '/dev/nvme1n1p1': best uberblock found for spa rpool. txg 8540642
1663679572   spa_misc.c:418:spa_load_note(): spa_load(rpool, config untrusted): using uberblock with txg=8540642
1663679572   spa_misc.c:418:spa_load_note(): spa_load(rpool, config trusted): read 119 log space maps (119 total blocks - blksz = 131072 bytes) in 13 ms
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 0, spa rpool, vdev_id 0, ms_id 97, smp_length 114776, unflushed_allocs 167936, unflushed_frees 167936, freed 0, defer 0 + 0, unloaded time 2025 ms, loading_time 1 ms, ms_max_size 17072435200, max size error 17072304128, old_weight 840000000000001, new_weight 840000000000001
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 0, spa rpool, vdev_id 0, ms_id 66, smp_length 463200, unflushed_allocs 8974336, unflushed_frees 7659520, freed 0, defer 0 + 0, unloaded time 2029 ms, loading_time 16 ms, ms_max_size 1163558912, max size error 1162223616, old_weight 780000000000001, new_weight 780000000000001
1663679572   mmp.c:240:mmp_thread_start(): MMP thread started pool 'rpool' gethrtime 2705286387
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540643, spa rpool, vdev_id 0, ms_id 38, smp_length 317784, unflushed_allocs 106496, unflushed_frees 745472, freed 0, defer 0 + 0, unloaded time 2705 ms, loading_time 10 ms, ms_max_size 2553995264, max size error 2553864192, old_weight 7c0000000000001, new_weight 7c0000000000001
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540643, spa rpool, vdev_id 0, ms_id 115, smp_length 437272, unflushed_allocs 1769472, unflushed_frees 1208320, freed 0, defer 0 + 0, unloaded time 2716 ms, loading_time 14 ms, ms_max_size 2494697472, max size error 2494554112, old_weight 7c0000000000001, new_weight 7c0000000000001
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540643, spa rpool, vdev_id 0, ms_id 47, smp_length 388664, unflushed_allocs 368640, unflushed_frees 208896, freed 0, defer 0 + 0, unloaded time 2730 ms, loading_time 12 ms, ms_max_size 1541840896, max size error 1541804032, old_weight 780000000000002, new_weight 780000000000002
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540643, spa rpool, vdev_id 0, ms_id 54, smp_length 96472, unflushed_allocs 2150400, unflushed_frees 1695744, freed 0, defer 0 + 0, unloaded time 2743 ms, loading_time 3 ms, ms_max_size 2144243712, max size error 2143928320, old_weight 780000000000002, new_weight 780000000000002
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540643, spa rpool, vdev_id 0, ms_id 64, smp_length 189064, unflushed_allocs 2203648, unflushed_frees 2736128, freed 0, defer 0 + 0, unloaded time 2747 ms, loading_time 6 ms, ms_max_size 1402626048, max size error 1402396672, old_weight 780000000000002, new_weight 780000000000002
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540643, spa rpool, vdev_id 0, ms_id 46, smp_length 61368, unflushed_allocs 950272, unflushed_frees 651264, freed 0, defer 0 + 0, unloaded time 2753 ms, loading_time 1 ms, ms_max_size 2106228736, max size error 2106138624, old_weight 780000000000001, new_weight 780000000000001
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540643, spa rpool, vdev_id 0, ms_id 50, smp_length 176232, unflushed_allocs 106496, unflushed_frees 122880, freed 0, defer 0 + 0, unloaded time 2755 ms, loading_time 4 ms, ms_max_size 1947385856, max size error 1947365376, old_weight 780000000000001, new_weight 780000000000001
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540643, spa rpool, vdev_id 0, ms_id 112, smp_length 205840, unflushed_allocs 569344, unflushed_frees 1355776, freed 0, defer 0 + 0, unloaded time 2760 ms, loading_time 7 ms, ms_max_size 1717813248, max size error 1717731328, old_weight 780000000000001, new_weight 780000000000001
1663679572   spa.c:8370:spa_async_request(): spa=rpool async request task=1
1663679572   spa.c:8370:spa_async_request(): spa=rpool async request task=2048
1663679572   spa_misc.c:418:spa_load_note(): spa_load(rpool, config trusted): LOADED
1663679572   spa_history.c:307:spa_history_log_sync(): txg 8540646 open pool version 5000; software version zfs-2.1.5-0-g6c3c5fcfb-dist; uts archlinux 5.15.68-1-lts #1 SMP Thu, 15 Sep 2022 09:53:50 +0000 x86_64
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540646, spa rpool, vdev_id 0, ms_id 113, smp_length 305712, unflushed_allocs 577536, unflushed_frees 696320, freed 0, defer 0 + 0, unloaded time 2772 ms, loading_time 11 ms, ms_max_size 1385553920, max size error 1385422848, old_weight 780000000000001, new_weight 780000000000001
1663679572   spa.c:8370:spa_async_request(): spa=rpool async request task=32
1663679572   spa_history.c:307:spa_history_log_sync(): txg 8540651 import pool version 5000; software version zfs-2.1.5-0-g6c3c5fcfb-dist; uts archlinux 5.15.68-1-lts #1 SMP Thu, 15 Sep 2022 09:53:50 +0000 x86_64
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540651, spa rpool, vdev_id 0, ms_id 117, smp_length 109568, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 2789 ms, loading_time 5 ms, ms_max_size 2146680832, max size error 2146680832, old_weight 780000000000001, new_weight 780000000000001
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540655, spa rpool, vdev_id 0, ms_id 35, smp_length 206736, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 2789 ms, loading_time 12 ms, ms_max_size 1047310336, max size error 1047310336, old_weight 740000000000005, new_weight 740000000000005
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540655, spa rpool, vdev_id 0, ms_id 62, smp_length 253328, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 2789 ms, loading_time 13 ms, ms_max_size 979312640, max size error 979312640, old_weight 740000000000009, new_weight 740000000000009
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540655, spa rpool, vdev_id 0, ms_id 118, smp_length 362848, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 2789 ms, loading_time 15 ms, ms_max_size 2145361920, max size error 2145361920, old_weight 780000000000001, new_weight 780000000000001
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540655, spa rpool, vdev_id 0, ms_id 116, smp_length 314176, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 2789 ms, loading_time 16 ms, ms_max_size 2142957568, max size error 2142957568, old_weight 780000000000001, new_weight 780000000000001
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540655, spa rpool, vdev_id 0, ms_id 55, smp_length 249808, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 2795 ms, loading_time 12 ms, ms_max_size 1000140800, max size error 1000140800, old_weight 740000000000005, new_weight 740000000000005
1663679572   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540655, spa rpool, vdev_id 0, ms_id 114, smp_length 389632, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 2789 ms, loading_time 19 ms, ms_max_size 2017890304, max size error 2017890304, old_weight 780000000000001, new_weight 780000000000001
1663679573   metaslab.c:2436:metaslab_load_impl(): metaslab_load: txg 8540655, spa rpool, vdev_id 0, ms_id 67, smp_length 496592, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 2802 ms, loading_time 19 ms, ms_max_size 1070657536, max size error 1070657536, old_weight 740000000000003, new_weight 740000000000003
1663679578   spa_history.c:294:spa_history_log_sync(): command: zpool import -c /etc/zfs/zpool.cache.org -N rpool
1663679588   spa_history.c:330:spa_history_log_sync(): ioctl load-key
1663679588   spa_history.c:294:spa_history_log_sync(): command: zfs load-key rpool

1663681615 is when the error in journalctl happens.

1663681918 zio.c:2088:zio_deadman(): zio_wait waiting for hung I/O to pool 'rpool' 1663681979 zio.c:2088:zio_deadman(): zio_wait waiting for hung I/O to pool 'rpool'

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

snajpa commented 5 days ago

is this still a problem in more recent releases / master branch?