openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.44k stars 1.73k forks source link

spl_dynamic_tas Kernel panic on Centos Stream #13371

Open serg-ku opened 2 years ago

serg-ku commented 2 years ago

System information

Type Version/Name
Distribution Name CentOS
Distribution Version Stream release 8
Kernel Version 4.18.0-373.el8.x86_64
Architecture x86_64
OpenZFS Version 2.0.7-1

Describe the problem you're observing

After upgrading from Centos 8.2 and zfs 0.8.3 to Centos Stream and zfs 2.0.7 on server serving VM images from zfs via NFS two kernel panics fired up sequentially. Host was up for ~4 days, then panicked and rebooted, after 2 hours panicked again and still up (~6 hours) According to monitoring there was no unusual disk/cpu/network activity before panic.

Describe how to reproduce the problem

no idea

Include any warning/errors/backtraces from the system logs

[303900.253675] ------------[ cut here ]------------
[303901.386844] NMI watchdog: Watchdog detected hard LOCKUP on cpu 9Modules linked in: rpcsec_gss_krb5 team_mode_loadbalance 8021q garp mrp stp llc team nft_counter nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables_set nf_tables libcrc32c nfnetlink zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) intel_rapl_msr intel_rapl_common spl(OE) sb_edac iTCO_wdt x86_pkg_temp_thermal intel_powerclamp iTCO_vendor_support coretemp kvm_intel kvm irqbypass rapl intel_cstate intel_uncore pcspkr i2c_i801 mei_me ipmi_ssif lpc_ich joydev mei ioatdma acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace ext4 mbcache jbd2 raid1 sd_mod t10_pi sg ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper crct10dif_pclmul syscopyarea sysfillrect sysimgblt fb_sys_fops crc32_pclmul crc32c_intel drm ahci libahci ghash_clmulni_intel libata igb i40e dca i2c_algo_bit wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
[303901.386867] CPU: 9 PID: 2073 Comm: spl_dynamic_tas Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-373.el8.x86_64 #1
[303901.386868] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 3.1c 05/02/2019
[303901.386868] RIP: 0010:native_queued_spin_lock_slowpath+0x5d/0x1b0
[303901.386868] Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 00 75
[303901.386869] RSP: 0018:ffffbaa4876035c8 EFLAGS: 00000002
[303901.386869] RAX: 0000000000000101 RBX: ffff9e95c02b8000 RCX: 0000000000000009
[303901.386870] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9eabdfc6ae40
[303901.386870] RBP: ffff9eabdfc6ae40 R08: ffff9eabdfc6a760 R09: ffff9e94c0400098
[303901.386871] R10: 0000000000000000 R11: ffffffff9565b548 R12: 0000000000000000
[303901.386871] R13: ffff9e95c02b8bbc R14: 0000000000000087 R15: 0000000000000009
[303901.386871] FS:  0000000000000000(0000) GS:ffff9eabdfc40000(0000) knlGS:0000000000000000
[303901.386872] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[303901.386872] CR2: 00007f0ea85e0000 CR3: 0000000c0fc10006 CR4: 00000000001706e0
[303901.386872] Call Trace:
[303901.386872]  _raw_spin_lock+0x1a/0x20
[303901.386873]  try_to_wake_up+0x15d/0x510
[303901.386873]  __queue_work+0x13d/0x3e0
[303901.386873]  queue_work_on+0x34/0x40
[303901.386873]  soft_cursor+0x194/0x220
[303901.386874]  bit_cursor+0x3d2/0x610
[303901.386874]  ? bit_putcs+0x550/0x550
[303901.386874]  ? fbcon_cursor+0xff/0x170
[303901.386874]  hide_cursor+0x2a/0xa0
[303901.386875]  vt_console_print+0x3bd/0x400
[303901.386875]  console_unlock+0x35f/0x4a0
[303901.386875]  vprintk_emit+0x14d/0x250
[303901.386876]  printk+0x58/0x6f
[303901.386876]  __warn_printk+0x46/0x87
[303901.386876]  ? irq_work_queue+0x16/0x20
[303901.386876]  update_blocked_averages+0x6af/0x6e0
[303901.386877]  newidle_balance+0xcb/0x3c0
[303901.386877]  ? __switch_to_asm+0x41/0x70
[303901.386877]  pick_next_task_fair+0x3e/0x3b0
[303901.386877]  __schedule+0x146/0x830
[303901.386878]  schedule+0x35/0xa0
[303901.386878]  schedule_timeout+0x274/0x300
[303901.386878]  ? remove_entity_load_avg+0x31/0x80
[303901.386878]  ? check_preempt_curr+0x7a/0x90
[303901.386879]  ? ttwu_do_wakeup+0x19/0x160
[303901.386879]  wait_for_completion_killable+0xb6/0x160
[303901.386879]  __kthread_create_on_node+0xf4/0x1b0
[303901.386879]  ? __switch_to_asm+0x35/0x70
[303901.386880]  ? __switch_to_asm+0x41/0x70
[303901.386880]  ? taskq_thread_spawn+0x50/0x50 [spl]
[303901.386880]  kthread_create_on_node+0x49/0x60
[303901.386881]  spl_kthread_create+0x82/0xd0 [spl]
[303901.386881]  taskq_thread_create+0x61/0xe0 [spl]
[303901.386881]  taskq_thread_spawn_task+0xe/0x30 [spl]
[303901.386881]  taskq_thread+0x2d8/0x510 [spl]
[303901.386882]  ? wake_up_q+0x70/0x70
[303901.386882]  ? taskq_thread_spawn+0x50/0x50 [spl]
[303901.386882]  kthread+0x10a/0x120
[303901.386882]  ? set_kthread_struct+0x40/0x40
[303901.386883]  ret_from_fork+0x35/0x40
[303901.386883] Kernel panic - not syncing: Hard LOCKUP
[303901.386883] CPU: 9 PID: 2073 Comm: spl_dynamic_tas Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-373.el8.x86_64 #1
[303901.386884] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 3.1c 05/02/2019
[303901.386884] Call Trace:
[303901.386884]  <NMI>
[303901.386885]  dump_stack+0x41/0x60
[303901.386885]  panic+0xe7/0x2ac
[303901.386885]  ? __switch_to_asm+0x51/0x70
[303901.386885]  nmi_panic.cold.11+0xc/0xc
[303901.386886]  watchdog_overflow_callback.cold.7+0x5c/0x70
[303901.386886]  __perf_event_overflow+0x52/0xf0
[303901.386886]  handle_pmi_common+0x1f7/0x2d0
[303901.386886]  ? __set_pte_vaddr+0x32/0x50
[303901.386887]  ? __native_set_fixmap+0x24/0x30
[303901.386887]  intel_pmu_handle_irq+0xeb/0x410
[303901.386887]  perf_event_nmi_handler+0x2d/0x50
[303901.386887]  nmi_handle+0x63/0x110
[303901.386888]  default_do_nmi+0x49/0x100
[303901.386888]  do_nmi+0x1af/0x220
[303901.386888]  end_repeat_nmi+0x16/0x6f
[303901.386888] RIP: 0010:native_queued_spin_lock_slowpath+0x5d/0x1b0
[303901.386889] Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 00 75
[303901.386889] RSP: 0018:ffffbaa4876035c8 EFLAGS: 00000002
[303901.386890] RAX: 0000000000000101 RBX: ffff9e95c02b8000 RCX: 0000000000000009
[303901.386890] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9eabdfc6ae40
[303901.386891] RBP: ffff9eabdfc6ae40 R08: ffff9eabdfc6a760 R09: ffff9e94c0400098
[303901.386891] R10: 0000000000000000 R11: ffffffff9565b548 R12: 0000000000000000
[303901.386891] R13: ffff9e95c02b8bbc R14: 0000000000000087 R15: 0000000000000009
[303901.386892]  ? native_queued_spin_lock_slowpath+0x5d/0x1b0
[303901.386892]  ? native_queued_spin_lock_slowpath+0x5d/0x1b0
[303901.386892]  </NMI>
[303901.386893]  _raw_spin_lock+0x1a/0x20
[303901.386893]  try_to_wake_up+0x15d/0x510
[303901.386893]  __queue_work+0x13d/0x3e0
[303901.386893]  queue_work_on+0x34/0x40
[303901.386894]  soft_cursor+0x194/0x220
[303901.386894]  bit_cursor+0x3d2/0x610
[303901.386894]  ? bit_putcs+0x550/0x550
[303901.386894]  ? fbcon_cursor+0xff/0x170
[303901.386895]  hide_cursor+0x2a/0xa0
[303901.386895]  vt_console_print+0x3bd/0x400
[303901.386895]  console_unlock+0x35f/0x4a0
[303901.386895]  vprintk_emit+0x14d/0x250
[303901.386896]  printk+0x58/0x6f
[303901.386896]  __warn_printk+0x46/0x87
[303901.386896]  ? irq_work_queue+0x16/0x20
[303901.386896]  update_blocked_averages+0x6af/0x6e0
[303901.386897]  newidle_balance+0xcb/0x3c0
[303901.386897]  ? __switch_to_asm+0x41/0x70
[303901.386897]  pick_next_task_fair+0x3e/0x3b0
[303901.386897]  __schedule+0x146/0x830
[303901.386898]  schedule+0x35/0xa0
[303901.386898]  schedule_timeout+0x274/0x300
[303901.386898]  ? remove_entity_load_avg+0x31/0x80
[303901.386898]  ? check_preempt_curr+0x7a/0x90
[303901.386899]  ? ttwu_do_wakeup+0x19/0x160
[303901.386899]  wait_for_completion_killable+0xb6/0x160
[303901.386899]  __kthread_create_on_node+0xf4/0x1b0
[303901.386899]  ? __switch_to_asm+0x35/0x70
[303901.386900]  ? __switch_to_asm+0x41/0x70
[303901.386900]  ? taskq_thread_spawn+0x50/0x50 [spl]
[303901.386900]  kthread_create_on_node+0x49/0x60
[303901.386900]  spl_kthread_create+0x82/0xd0 [spl]
[303901.386901]  taskq_thread_create+0x61/0xe0 [spl]
[303901.386901]  taskq_thread_spawn_task+0xe/0x30 [spl]
[303901.386901]  taskq_thread+0x2d8/0x510 [spl]
[303901.386902]  ? wake_up_q+0x70/0x70
[303901.386902]  ? taskq_thread_spawn+0x50/0x50 [spl]
[303901.386902]  kthread+0x10a/0x120
[303901.386902]  ? set_kthread_struct+0x40/0x40
[303901.386903]  ret_from_fork+0x35/0x40
serg-ku commented 2 years ago

looks like same bug was hit in issue https://github.com/openzfs/zfs/issues/13201

serg-ku commented 2 years ago

No panic with kernel 4.18.0-383.el8 for almost a month, looks like something is fixed.

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.