openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.56k stars 1.74k forks source link

System Freeze when running scrub #14972

Open frawau opened 1 year ago

frawau commented 1 year ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04, 23.04
Kernel Version 6.2.0-20
Architecture x86_64
OpenZFS Version 2.1.9

Describe the problem you're observing

The system freezes regularly, it also freezes every time I run a scrub.

My system has been behaving weirdly for some time. I was running Ubuntu 22.04 and the command "zpool status" was sometime showing all devices with exactly the same number of checksum errors, an unlikely feat.

I was using a TUF X570-based motherboard, so I decided to update the BIOS and the thing died. I thought I had found the culprit.

I replace the motherboard, the memory modules and the SATA cables.

After changing all that, the problem still happened.

So I decided to changed to Ubuntu 23.04 with Linux kernel 6.x

The problem still happens.

I am using 6 WD Red 8TB disks in raidz1 mode

pool: Universe state: ONLINE scan: scrub canceled on Sun Jun 11 13:33:54 2023 config:

    NAME        STATE     READ WRITE CKSUM
    Universe    ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        sda     ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdf     ONLINE       0     0     0

My CPU is a AMD Ryzen 5 3600 6-Core Processor Motherboard is Micro-Star International Co., Ltd. MS-7D54/MAG X570S TORPEDO MAX (MS-7D54)

SMART indicates that all 6 HDD are OK

Describe how to reproduce the problem

zpool scrub Universe

Include any warning/errors/backtraces from the system logs

""" 2023-06-11T13:19:57.393010+07:00 portland zed: eid=18 class=scrub_start pool='Universe' 2023-06-11T13:19:57.439992+07:00 portland systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories... 2023-06-11T13:19:57.457021+07:00 portland systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully. 2023-06-11T13:19:57.457151+07:00 portland systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories. 2023-06-11T13:19:57.459596+07:00 portland systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully. 2023-06-11T13:20:18.699440+07:00 portland kernel: [ 931.878163] ------------[ cut here ]------------ 2023-06-11T13:20:18.699451+07:00 portland kernel: [ 931.878169] rq->clock_update_flags < RQCF_ACT_SKIP 2023-06-11T13:20:18.699452+07:00 portland kernel: [ 931.878173] WARNING: CPU: 8 PID: 0 at kernel/sched/sched.h:1491 update_rq_clock+0x184/0x230 2023-06-11T13:20:18.699453+07:00 portland kernel: [ 931.878181] Modules linked in: tls vhost_net vhost vhost_iotlb tap bridge stp llc cfg80211 binfmt_misc nls_iso8859_1 snd_hda_codec_hdmi intel_rapl_msr snd_hda_intel zfs(PO) intel_rapl_common snd_intel_dspcfg snd_usb_audio snd_intel_sdw_acpi snd_usbmidi_lib zunicode(PO) edac_mce_amd snd_hda_codec snd_rawmidi zzstd(O) kvm_amd snd_hda_core snd_seq_device zlua(O) mc snd_hwdep zavl(PO) kvm snd_pcm icp(PO) snd_timer irqbypass zcommon(PO) snd rapl znvpair(PO) wmi_bmof k10temp ccp soundcore spl(O) joydev mac_hid nfsd auth_rpcgss nfs_acl lockd dm_multipath scsi_dh_rdac scsi_dh_emc grace scsi_dh_alua msr efi_pstore sunrpc dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nouveau mxm_wmi i2c_algo_bit drm_ttm_helper ttm drm_display_helper cec hid_generic crct10dif_pclmul rc_core crc32_pclmul drm_kms_helper polyval_clmulni syscopyarea polyval_generic sysfillrect 2023-06-11T13:20:18.699454+07:00 portland kernel: [ 931.878246] usbhid ghash_clmulni_intel sysimgblt hid sha512_ssse3 aesni_intel nvme drm crypto_simd r8169 ahci xhci_pci nvme_core video cryptd i2c_piix4 libahci xhci_pci_renesas realtek nvme_common wmi 2023-06-11T13:20:18.699454+07:00 portland kernel: [ 931.878260] CPU: 8 PID: 0 Comm: swapper/8 Tainted: P O 6.2.0-20-generic #20-Ubuntu 2023-06-11T13:20:18.699454+07:00 portland kernel: [ 931.878262] Hardware name: Micro-Star International Co., Ltd. MS-7D54/MAG X570S TORPEDO MAX (MS-7D54), BIOS A.60 04/29/2023 2023-06-11T13:20:18.699455+07:00 portland kernel: [ 931.878264] RIP: 0010:update_rq_clock+0x184/0x230 2023-06-11T13:20:18.699455+07:00 portland kernel: [ 931.878267] Code: 0f b6 25 1c b1 c7 02 41 80 fc 01 0f 87 f9 f7 f1 00 41 83 e4 01 75 15 48 c7 c7 90 ec f5 93 c6 05 fe b0 c7 02 01 e8 5c 2c fb ff <0f> 0b 48 8b 93 40 0a 00 00 8b 83 08 0a 00 00 48 89 93 48 0a 00 00 2023-06-11T13:20:18.699456+07:00 portland kernel: [ 931.878268] RSP: 0018:ffffbf508035ce28 EFLAGS: 00010046 2023-06-11T13:20:18.699456+07:00 portland kernel: [ 931.878270] RAX: 0000000000000000 RBX: ffff9b37bf0316c0 RCX: 0000000000000000 2023-06-11T13:20:18.699457+07:00 portland kernel: [ 931.878271] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 2023-06-11T13:20:18.699471+07:00 portland kernel: [ 931.878272] RBP: ffffbf508035ce48 R08: 0000000000000000 R09: 0000000000000000 2023-06-11T13:20:18.699473+07:00 portland kernel: [ 931.878273] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 2023-06-11T13:20:18.699474+07:00 portland kernel: [ 931.878274] R13: 0000000000000008 R14: 0000000000000008 R15: ffff9b30c09c0000 2023-06-11T13:20:18.699474+07:00 portland kernel: [ 931.878275] FS: 0000000000000000(0000) GS:ffff9b37bf000000(0000) knlGS:0000000000000000 2023-06-11T13:20:18.699475+07:00 portland kernel: [ 931.878276] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2023-06-11T13:20:18.699475+07:00 portland kernel: [ 931.878278] CR2: 00005561beebe928 CR3: 00000001938b6000 CR4: 0000000000350ee0 2023-06-11T13:20:18.699476+07:00 portland kernel: [ 931.878279] Call Trace: 2023-06-11T13:20:18.699477+07:00 portland kernel: [ 931.878281] 2023-06-11T13:20:18.699477+07:00 portland kernel: [ 931.878282] ? arch_scale_freq_tick+0x3a/0x120 2023-06-11T13:20:18.699478+07:00 portland kernel: [ 931.878287] scheduler_tick+0x9a/0x330 2023-06-11T13:20:18.699478+07:00 portland kernel: [ 931.878290] update_process_times+0x89/0xb0 2023-06-11T13:20:18.699478+07:00 portland kernel: [ 931.878293] tick_sched_handle+0x29/0x70 2023-06-11T13:20:18.699479+07:00 portland kernel: [ 931.878296] tick_sched_timer+0x70/0x90 2023-06-11T13:20:18.699479+07:00 portland kernel: [ 931.878298] ? pfx_tick_sched_timer+0x10/0x10 2023-06-11T13:20:18.699479+07:00 portland kernel: [ 931.878300] __hrtimer_run_queues+0x108/0x280 2023-06-11T13:20:18.699480+07:00 portland kernel: [ 931.878303] hrtimer_interrupt+0xf6/0x250 2023-06-11T13:20:18.699480+07:00 portland kernel: [ 931.878306] sysvec_apic_timer_interrupt+0x62/0x140 2023-06-11T13:20:18.699481+07:00 portland kernel: [ 931.878309] sysvec_apic_timer_interrupt+0x8d/0xd0 2023-06-11T13:20:18.699481+07:00 portland kernel: [ 931.878313] 2023-06-11T13:20:18.699481+07:00 portland kernel: [ 931.878314] 2023-06-11T13:20:18.699481+07:00 portland kernel: [ 931.878315] asm_sysvec_apic_timer_interrupt+0x1b/0x20 2023-06-11T13:20:18.699482+07:00 portland kernel: [ 931.878318] RIP: 0010:cpuidle_enter_state+0xde/0x6f0 2023-06-11T13:20:18.699482+07:00 portland kernel: [ 931.878322] Code: f3 ce 6c e8 04 d1 42 ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 62 bb 41 ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 12 02 00 00 4d 63 ee 49 83 fd 09 0f 87 c7 04 00 00 2023-06-11T13:20:18.699482+07:00 portland kernel: [ 931.878323] RSP: 0018:ffffbf508019fe28 EFLAGS: 00000246 2023-06-11T13:20:18.699483+07:00 portland kernel: [ 931.878324] RAX: 0000000000000000 RBX: ffff9b30c47ec000 RCX: 0000000000000000 2023-06-11T13:20:18.699483+07:00 portland kernel: [ 931.878325] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000 2023-06-11T13:20:18.699484+07:00 portland kernel: [ 931.878326] RBP: ffffbf508019fe78 R08: 0000000000000000 R09: 0000000000000000 2023-06-11T13:20:18.699484+07:00 portland kernel: [ 931.878327] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff952d51e0 2023-06-11T13:20:18.699484+07:00 portland kernel: [ 931.878328] R13: 0000000000000002 R14: 0000000000000002 R15: 000000d8f8444d31 2023-06-11T13:20:18.699485+07:00 portland kernel: [ 931.878331] ? cpuidle_enter_state+0xce/0x6f0 2023-06-11T13:20:18.699485+07:00 portland kernel: [ 931.878333] cpuidle_enter+0x2e/0x50 2023-06-11T13:20:18.699485+07:00 portland kernel: [ 931.878335] cpuidle_idle_call+0x153/0x1e0 2023-06-11T13:20:18.699486+07:00 portland kernel: [ 931.878338] do_idle+0x82/0x100 2023-06-11T13:20:18.699486+07:00 portland kernel: [ 931.878339] cpu_startup_entry+0x1d/0x20 2023-06-11T13:20:18.699486+07:00 portland kernel: [ 931.878341] start_secondary+0x122/0x160 2023-06-11T13:20:18.699487+07:00 portland kernel: [ 931.878343] secondary_startup_64_no_verify+0xe5/0xeb 2023-06-11T13:20:18.699487+07:00 portland kernel: [ 931.878348] 2023-06-11T13:20:18.699487+07:00 portland kernel: [ 931.878348] ---[ end trace 0000000000000000 ]--- 2023-06-11T13:20:18.699488+07:00 portland kernel: [ 931.881305] BUG: kernel NULL pointer dereference, address: 0000000000000004 2023-06-11T13:20:18.699488+07:00 portland kernel: [ 931.881312] #PF: supervisor read access in kernel mode 2023-06-11T13:20:18.699488+07:00 portland kernel: [ 931.881315] #PF: error_code(0x0000) - not-present page 2023-06-11T13:20:18.699489+07:00 portland kernel: [ 931.881318] PGD 0 P4D 0 2023-06-11T13:20:18.699489+07:00 portland kernel: [ 931.881321] Oops: 0000 [#1] PREEMPT SMP NOPTI 2023-06-11T13:20:18.699489+07:00 portland kernel: [ 931.881324] CPU: 1 PID: 69723 Comm: z_rd_int_1 Tainted: P W O 6.2.0-20-generic #20-Ubuntu 2023-06-11T13:20:18.699490+07:00 portland kernel: [ 931.881328] Hardware name: Micro-Star International Co., Ltd. MS-7D54/MAG X570S TORPEDO MAX (MS-7D54), BIOS A.60 04/29/2023 2023-06-11T13:20:18.699490+07:00 portland kernel: [ 931.881331] RIP: 0010:abd_is_gang+0x0/0x10 [zfs] 2023-06-11T13:20:18.699490+07:00 portland kernel: [ 931.881586] Code: 90 90 90 90 90 90 90 90 90 90 8b 07 c1 e8 05 83 e0 01 31 ff e9 91 25 e6 d1 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <8b> 07 c1 e8 06 83 e0 01 31 ff e9 71 25 e6 d1 90 90 90 90 90 90 90 2023-06-11T13:20:18.699491+07:00 portland kernel: [ 931.881591] RSP: 0018:ffffbf509558fd38 EFLAGS: 00010202 2023-06-11T13:20:18.699491+07:00 portland kernel: [ 931.881594] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000 2023-06-11T13:20:18.699491+07:00 portland kernel: [ 931.881596] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004 2023-06-11T13:20:18.699492+07:00 portland kernel: [ 931.881598] RBP: ffffbf509558fd48 R08: 0000000000000000 R09: 0000000000000000 2023-06-11T13:20:18.699492+07:00 portland kernel: [ 931.881601] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9b30e5d11000 2023-06-11T13:20:18.699493+07:00 portland kernel: [ 931.881603] R13: 0000000000000000 R14: ffff9b3314401940 R15: ffff9b3315a33f60 2023-06-11T13:20:18.699493+07:00 portland kernel: [ 931.881605] FS: 0000000000000000(0000) GS:ffff9b37bee40000(0000) knlGS:0000000000000000 2023-06-11T13:20:18.699493+07:00 portland kernel: [ 931.881609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2023-06-11T13:20:18.699493+07:00 portland kernel: [ 931.881612] CR2: 0000000000000004 CR3: 0000000194bf0000 CR4: 0000000000350ee0 2023-06-11T13:20:18.699494+07:00 portland kernel: [ 931.881615] Call Trace: 2023-06-11T13:20:18.699494+07:00 portland kernel: [ 931.881616] 2023-06-11T13:20:18.699495+07:00 portland kernel: [ 931.881618] ? abd_free+0x1b/0xb0 [zfs] 2023-06-11T13:20:18.699495+07:00 portland kernel: [ 931.881727] vdev_raidz_row_free+0x38/0xa0 [zfs] 2023-06-11T13:20:18.699495+07:00 portland kernel: [ 931.881877] vdev_raidz_map_free+0x29/0x60 [zfs] 2023-06-11T13:20:18.699495+07:00 portland kernel: [ 931.882019] vdev_raidz_map_free_vsd+0x15/0x20 [zfs] 2023-06-11T13:20:18.699502+07:00 portland kernel: [ 931.882152] zio_vdev_io_assess+0x52/0x2f0 [zfs] 2023-06-11T13:20:18.825539+07:00 portland kernel: [ 931.882282] zio_execute+0x92/0xf0 [zfs] 2023-06-11T13:20:18.825556+07:00 portland kernel: [ 931.882406] taskq_thread+0x229/0x400 [spl] 2023-06-11T13:20:18.825556+07:00 portland kernel: [ 931.882420] ? __pfx_default_wake_function+0x10/0x10 2023-06-11T13:20:18.825557+07:00 portland kernel: [ 931.882424] ? pfx_zio_execute+0x10/0x10 [zfs] 2023-06-11T13:20:18.825558+07:00 portland kernel: [ 931.882547] ? pfx_taskq_thread+0x10/0x10 [spl] 2023-06-11T13:20:18.825558+07:00 portland kernel: [ 931.882558] kthread+0xe9/0x110 2023-06-11T13:20:18.825559+07:00 portland kernel: [ 931.882562] ? __pfx_kthread+0x10/0x10 2023-06-11T13:20:18.825560+07:00 portland kernel: [ 931.882566] ret_from_fork+0x2c/0x50 2023-06-11T13:20:18.825560+07:00 portland kernel: [ 931.882570] 2023-06-11T13:20:18.825560+07:00 portland kernel: [ 931.882571] Modules linked in: tls vhost_net vhost vhost_iotlb tap bridge stp llc cfg80211 binfmt_misc nls_iso8859_1 snd_hda_codec_hdmi intel_rapl_msr snd_hda_intel zfs(PO) intel_rapl_common snd_intel_dspcfg snd_usb_audio snd_intel_sdw_acpi snd_usbmidi_lib zunicode(PO) edac_mce_amd snd_hda_codec snd_rawmidi zzstd(O) kvm_amd snd_hda_core snd_seq_device zlua(O) mc snd_hwdep zavl(PO) kvm snd_pcm icp(PO) snd_timer irqbypass zcommon(PO) snd rapl znvpair(PO) wmi_bmof k10temp ccp soundcore spl(O) joydev mac_hid nfsd auth_rpcgss nfs_acl lockd dm_multipath scsi_dh_rdac scsi_dh_emc grace scsi_dh_alua msr efi_pstore sunrpc dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nouveau mxm_wmi i2c_algo_bit drm_ttm_helper ttm drm_display_helper cec hid_generic crct10dif_pclmul rc_core crc32_pclmul drm_kms_helper polyval_clmulni syscopyarea polyval_generic sysfillrect 2023-06-11T13:20:18.825561+07:00 portland kernel: [ 931.882612] usbhid ghash_clmulni_intel sysimgblt hid sha512_ssse3 aesni_intel nvme drm crypto_simd r8169 ahci xhci_pci nvme_core video cryptd i2c_piix4 libahci xhci_pci_renesas realtek nvme_common wmi 2023-06-11T13:20:18.825562+07:00 portland kernel: [ 931.882639] CR2: 0000000000000004 2023-06-11T13:20:18.825562+07:00 portland kernel: [ 931.882642] ---[ end trace 0000000000000000 ]--- 2023-06-11T13:20:18.825563+07:00 portland kernel: [ 932.008113] RIP: 0010:abd_is_gang+0x0/0x10 [zfs] 2023-06-11T13:20:18.825563+07:00 portland kernel: [ 932.008244] Code: 90 90 90 90 90 90 90 90 90 90 8b 07 c1 e8 05 83 e0 01 31 ff e9 91 25 e6 d1 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <8b> 07 c1 e8 06 83 e0 01 31 ff e9 71 25 e6 d1 90 90 90 90 90 90 90 2023-06-11T13:20:18.825564+07:00 portland kernel: [ 932.008249] RSP: 0018:ffffbf509558fd38 EFLAGS: 00010202 2023-06-11T13:20:18.825565+07:00 portland kernel: [ 932.008252] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000 2023-06-11T13:20:18.825574+07:00 portland kernel: [ 932.008255] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004 2023-06-11T13:20:18.825575+07:00 portland kernel: [ 932.008257] RBP: ffffbf509558fd48 R08: 0000000000000000 R09: 0000000000000000 2023-06-11T13:20:18.825576+07:00 portland kernel: [ 932.008260] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9b30e5d11000 2023-06-11T13:20:18.825576+07:00 portland kernel: [ 932.008262] R13: 0000000000000000 R14: ffff9b3314401940 R15: ffff9b3315a33f60 2023-06-11T13:20:18.825577+07:00 portland kernel: [ 932.008265] FS: 0000000000000000(0000) GS:ffff9b37bee40000(0000) knlGS:0000000000000000 2023-06-11T13:20:18.825578+07:00 portland kernel: [ 932.008268] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2023-06-11T13:20:18.825579+07:00 portland kernel: [ 932.008270] CR2: 0000000000000004 CR3: 0000000194bf0000 CR4: 0000000000350ee0 2023-06-11T13:20:18.825579+07:00 portland kernel: [ 932.008273] note: z_rd_int_1[69723] exited with irqs disabled 2023-06-11T13:20:19.043091+07:00 portland zed: eid=20 class=checksum pool='Universe' vdev=sdd1 size=28672 offset=23001530368 priority=4 err=0 flags=0x1008b0 bookmark=387:128:0:55891 2023-06-11T13:20:19.518599+07:00 portland zed: eid=21 class=checksum pool='Universe' vdev=sdc1 size=28672 offset=23101927424 priority=4 err=0 flags=0x1008b0 bookmark=387:128:0:59567 2023-06-11T13:20:20.768537+07:00 portland zed: eid=22 class=checksum pool='Universe' vdev=sdc1 size=28672 offset=23366561792 priority=4 err=0 flags=0x1008b0 bookmark=387:128:0:69257

"""

almightiest commented 1 year ago

Sounds like a bad power supply or power issue on load to me... If the system freezes outside of just a scrub operation (like heavy i/o or randomly), then it doesn't seem like a zfs specific bug?

frawau commented 1 year ago

Thanks.

Good idea, I will replace the power supply. Just in case.

rincebrain commented 1 year ago

FWIW, if a raidz stripe fails a checksum and can't reconstruct, it'll count it against all the disks in the stripe, since it can't know who's wrong in at least the single parity case.