Open dhagberg opened 4 years ago
Could this be some kind of resource starvation issue where the kernel is unable to create a new kthread?
Should I be looking at the zfs ARC size vs available kernel memory?
Or does this look like a legit bug?
It looks like an issue we also hit while working on ZSTD. It seems RHEL (based Distro's) has/have serieus issues with kmem allocations.
Thanks @Ornias1993 -- did you find a workaround? More stable under heavy IO under latest Ubuntu LTS release?
For zstd we simply opted to use a totally different system for memory allocation... which solved the problem there.
To be honest I haven't seen this error on normal use/testing, even on RHEL based systems...
Latest stack trace at 2019-12-09 19:09:22Z also has spl_kthread_create
:
[ 8.709736] type=1305 audit(1575484473.397:3): audit_pid=2357 old=0 auid=4294967295 ses=4294967295 res=1
[ 9.192114] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[ 9.223197] NET: Registered protocol family 40
[ 9.600365] vmxnet3 0000:0b:00.0 eno16780032: intr type 3, mode 0, 5 vectors allocated
[ 9.601841] vmxnet3 0000:0b:00.0 eno16780032: NIC Link is Up 10000 Mbps
[434100.315676] INFO: task spl_dynamic_tas:1171 blocked for more than 160 seconds.
[434100.315706] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[434100.315731] spl_dynamic_tas D ffff8f1435cd3150 0 1171 2 0x00000000
[434100.315757] Call Trace:
[434100.315773] [<ffffffffa2f7fb09>] schedule+0x29/0x70
[434100.315816] [<ffffffffa2f7d491>] schedule_timeout+0x221/0x2d0
[434100.315840] [<ffffffffa28e10b6>] ? select_task_rq_fair+0x5a6/0x760
[434100.315861] [<ffffffffa2f7febd>] wait_for_completion+0xfd/0x140
[434100.315883] [<ffffffffa28db1d0>] ? wake_up_state+0x20/0x20
[434100.315910] [<ffffffffc061c540>] ? taskq_thread_spawn+0x60/0x60 [spl]
[434100.315933] [<ffffffffa28c604a>] kthread_create_on_node+0xaa/0x140
[434100.315955] [<ffffffffa2b8caeb>] ? string.isra.7+0x3b/0xf0
[434100.315977] [<ffffffffc061c540>] ? taskq_thread_spawn+0x60/0x60 [spl]
[434100.316001] [<ffffffffc061c540>] ? taskq_thread_spawn+0x60/0x60 [spl]
[434100.316025] [<ffffffffc061dbfc>] spl_kthread_create+0x9c/0xf0 [spl]
[434100.316049] [<ffffffffc061d39b>] taskq_thread_create+0x6b/0x110 [spl]
[434100.316072] [<ffffffffc061d452>] taskq_thread_spawn_task+0x12/0x40 [spl]
[434100.316096] [<ffffffffc061c7ec>] taskq_thread+0x2ac/0x4f0 [spl]
[434100.316116] [<ffffffffa28db1d0>] ? wake_up_state+0x20/0x20
[434100.316691] [<ffffffffc061c540>] ? taskq_thread_spawn+0x60/0x60 [spl]
[434100.317159] [<ffffffffa28c61f1>] kthread+0xd1/0xe0
[434100.317613] [<ffffffffa28c6120>] ? insert_kthread_work+0x40/0x40
[434100.318074] [<ffffffffa2f8cd37>] ret_from_fork_nospec_begin+0x21/0x21
[434100.318535] [<ffffffffa28c6120>] ? insert_kthread_work+0x40/0x40
[434100.319014] sending NMI to all CPUs:
[434100.320586] NMI backtrace for cpu 0 skipped: idling at pc 0xffffffffa2f81beb
[434100.321078] NMI backtrace for cpu 1
[434100.321567] CPU: 1 PID: 40 Comm: khungtaskd Kdump: loaded Tainted: P OE ------------ 3.10.0-1062.4.3.el7.x86_64 #1
[434100.322095] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[434100.322633] task: ffff8f173fe820e0 ti: ffff8f173fbec000 task.ti: ffff8f173fbec000
[434100.323202] RIP: 0010:[<ffffffffa286d5ba>] [<ffffffffa286d5ba>] native_write_msr_safe+0xa/0x10
[434100.323790] RSP: 0018:ffff8f173fbefdb8 EFLAGS: 00000046
[434100.324362] RAX: 0000000000000400 RBX: 0000000000000001 RCX: 0000000000000830
[434100.324951] RDX: 0000000000000002 RSI: 0000000000000400 RDI: 0000000000000830
[434100.325530] RBP: ffff8f173fbefdb8 R08: ffffffffa35577a0 R09: ffff8f143547fac0
[434100.326135] R10: 0000000000000619 R11: ffffb4c8822a79d8 R12: ffffffffa35577a0
[434100.326737] R13: 0000000000000001 R14: 000000000000e026 R15: 0000000000000002
[434100.327328] FS: 0000000000000000(0000) GS:ffff8f173fc40000(0000) knlGS:0000000000000000
[434100.327943] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[434100.328560] CR2: 0000562e45714b10 CR3: 0000000332d0e000 CR4: 00000000001607e0
[434100.329199] Call Trace:
[434100.329825] [<ffffffffa28634f2>] __x2apic_send_IPI_mask+0xb2/0xe0
[434100.330462] [<ffffffffa2863593>] x2apic_send_IPI_mask+0x13/0x20
[434100.331111] [<ffffffffa285e923>] arch_trigger_all_cpu_backtrace+0x2c3/0x2d0
[434100.331792] [<ffffffffa294d990>] watchdog+0x260/0x2c0
[434100.332450] [<ffffffffa294d730>] ? reset_hung_task_detector+0x20/0x20
[434100.333118] [<ffffffffa28c61f1>] kthread+0xd1/0xe0
[434100.333795] [<ffffffffa28c6120>] ? insert_kthread_work+0x40/0x40
[434100.334446] [<ffffffffa2f8cd37>] ret_from_fork_nospec_begin+0x21/0x21
[434100.335088] [<ffffffffa28c6120>] ? insert_kthread_work+0x40/0x40
[434100.335722] Code: 00 55 89 f9 48 89 e5 0f 32 31 c9 89 c0 48 c1 e2 20 89 0e 48 09 c2 48 89 d0 5d c3 66 0f 1f 44 00 00 55 89 f0 89 f9 48 89 e5 0f 30 <31> c0 5d c3 66 90 55 89 f9 48 89 e5 0f 33 89 c0 48 c1 e2 20 48
[434100.337060] NMI backtrace for cpu 2 skipped: idling at pc 0xffffffffa2f81beb
[434100.337752] NMI backtrace for cpu 3 skipped: idling at pc 0xffffffffa2f81beb
[434100.338425] NMI backtrace for cpu 4 skipped: idling at pc 0xffffffffa2f81beb
[434100.339100] NMI backtrace for cpu 5 skipped: idling at pc 0xffffffffa2f81beb
[434100.339766] Kernel panic - not syncing: hung_task: blocked tasks
[434100.340430] CPU: 1 PID: 40 Comm: khungtaskd Kdump: loaded Tainted: P OE ------------ 3.10.0-1062.4.3.el7.x86_64 #1
[434100.341119] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[434100.341842] Call Trace:
[434100.342541] [<ffffffffa2f79ba4>] dump_stack+0x19/0x1b
[434100.343256] [<ffffffffa2f73947>] panic+0xe8/0x21f
[434100.343987] [<ffffffffa294d99e>] watchdog+0x26e/0x2c0
[434100.344696] [<ffffffffa294d730>] ? reset_hung_task_detector+0x20/0x20
[434100.345411] [<ffffffffa28c61f1>] kthread+0xd1/0xe0
[434100.346124] [<ffffffffa28c6120>] ? insert_kthread_work+0x40/0x40
[434100.346846] [<ffffffffa2f8cd37>] ret_from_fork_nospec_begin+0x21/0x21
[434100.347566] [<ffffffffa28c6120>] ? insert_kthread_work+0x40/0x40
Additional notes:
So prior to the latest panic, I had made the following changes to use the noop
scheduler on this device:
# cat /sys/block/sdc/queue/scheduler
noop [deadline] cfq
# echo noop > /sys/block/sdc/queue/scheduler
# cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq
And a udev rule to ensure that change was applied on reboot:
# cat /etc/udev/rules.d/70-ioschedulers.rules
ACTION=="add|change", KERNEL=="sd[a-z]", ENV{ID_FS_UUID}=="14120504590319066518", ATTR{queue/scheduler}="noop"
Those config changes had been made 2019-12-04 19:37Z, approx 5 days before the most recent stall/panic.
I have have hosts which run into this behavior, they are all Cent7 machines with current kernels on AWS or GCP. If there's anything I can do to gather information beyond what dhagberg did, feel free to reach out. We see this happen on hosts which are low on memory, under reasonable load, but not actually running out of memory yet.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
I just had a very similar issue. System is a CentOS Linux release 7.9.2009 (Core) hypervisor, running 3.10.0-1160.31.1.el7.x86_64 kernel and ZFS kmod 2.0.5-1. RAM is at 32 GB. This is the panic as recorded in syslog:
Oct 21 15:11:22 kvm kernel: INFO: task spl_dynamic_tas:684 blocked for more than 120 seconds.
Oct 21 15:11:22 kvm kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 21 15:11:22 kvm kernel: spl_dynamic_tas D ffff94514a1c9080 0 684 2 0x00000000
Oct 21 15:11:22 kvm kernel: Call Trace:
Oct 21 15:11:22 kvm kernel: [<ffffffff8ab891e9>] schedule+0x29/0x70
Oct 21 15:11:22 kvm kernel: [<ffffffff8ab86eb1>] schedule_timeout+0x221/0x2d0
Oct 21 15:11:22 kvm kernel: [<ffffffff8a4d73f2>] ? check_preempt_curr+0x92/0xa0
Oct 21 15:11:22 kvm kernel: [<ffffffff8a4d7419>] ? ttwu_do_wakeup+0x19/0xe0
Oct 21 15:11:22 kvm kernel: [<ffffffff8ab8959d>] wait_for_completion+0xfd/0x140
Oct 21 15:11:22 kvm kernel: [<ffffffff8a4dadc0>] ? wake_up_state+0x20/0x20
Oct 21 15:11:22 kvm kernel: [<ffffffffc0aa2230>] ? taskq_thread_spawn+0x60/0x60 [spl]
Oct 21 15:11:22 kvm kernel: [<ffffffff8a4c5c8a>] kthread_create_on_node+0xaa/0x140
Oct 21 15:11:22 kvm kernel: [<ffffffff8a79229b>] ? string.isra.7+0x3b/0xf0
Oct 21 15:11:22 kvm kernel: [<ffffffffc0aa2230>] ? taskq_thread_spawn+0x60/0x60 [spl]
Oct 21 15:11:22 kvm kernel: [<ffffffffc0aa2230>] ? taskq_thread_spawn+0x60/0x60 [spl]
Oct 21 15:11:22 kvm kernel: [<ffffffffc0aa391c>] spl_kthread_create+0x9c/0xf0 [spl]
Oct 21 15:11:22 kvm kernel: [<ffffffffc0aa30bb>] taskq_thread_create+0x6b/0x110 [spl]
Oct 21 15:11:22 kvm kernel: [<ffffffffc0aa3172>] taskq_thread_spawn_task+0x12/0x40 [spl]
Oct 21 15:11:22 kvm kernel: [<ffffffffc0aa24f6>] taskq_thread+0x2c6/0x520 [spl]
Oct 21 15:11:22 kvm kernel: [<ffffffff8a4dadc0>] ? wake_up_state+0x20/0x20
Oct 21 15:11:22 kvm kernel: [<ffffffffc0aa2230>] ? taskq_thread_spawn+0x60/0x60 [spl]
Oct 21 15:11:22 kvm kernel: [<ffffffff8a4c5e31>] kthread+0xd1/0xe0
Oct 21 15:11:22 kvm kernel: [<ffffffff8a4c5d60>] ? insert_kthread_work+0x40/0x40
Oct 21 15:11:22 kvm kernel: [<ffffffff8ab95ddd>] ret_from_fork_nospec_begin+0x7/0x21
Oct 21 15:11:22 kvm kernel: [<ffffffff8a4c5d60>] ? insert_kthread_work+0x40/0x40
All running VMs were stalled, not responding to pings. The host itself was stalled with regard to disk I/O (it was impossible to login via SSH even if the root partition is not on ZFS), but network I/O was working (a SSH tunnel to another machine could be established). The host had >8GB free memory, but considerable swap (~8GB) was used and I tried to investigate. It looked as if the nightly backup to an NFS host was pushing used memory into swap area due to memory pressure caused by ARC (with target at 98%, ~15GB).
I swapoff -a
to pagein all swapped memory and this left the system with a reduced ARC (~9 GB) and ~4GB free mem, then I set vm.swappiness=0
and read a big file from the ZFS dataset. At this point the machine slowly froze to a complete halt (livelocked?). I had to reboot it via IPMI.
As a side note, consider that in the previous days I was unable to start a block tracing debug (via blktrace
) due to a very fragmented memory layout:
Oct 18 22:39:35 kvm kernel: blktrace: page allocation failure: order:4, mode:0xc0d0
Oct 18 22:39:35 kvm kernel: CPU: 3 PID: 11719 Comm: blktrace Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.31.1.el7.x86_64 #1
Oct 18 22:39:35 kvm kernel: Hardware name: Supermicro SYS-5039A-IL/X11SAE, BIOS 2.3 06/21/2018
Oct 18 22:39:35 kvm kernel: Call Trace:
Oct 18 22:39:35 kvm kernel: [<ffffffff8ab835a9>] dump_stack+0x19/0x1b
Oct 18 22:39:35 kvm kernel: [<ffffffff8a5c46c0>] warn_alloc_failed+0x110/0x180
Oct 18 22:39:35 kvm kernel: [<ffffffff8a5c925f>] __alloc_pages_nodemask+0x9df/0xbe0
Oct 18 22:39:35 kvm kernel: [<ffffffff8a618ea8>] alloc_pages_current+0x98/0x110
Oct 18 22:39:35 kvm kernel: [<ffffffff8a5e5ad8>] kmalloc_order+0x18/0x40
Oct 18 22:39:35 kvm kernel: [<ffffffff8a624876>] kmalloc_order_trace+0x26/0xa0
Oct 18 22:39:35 kvm kernel: [<ffffffff8a55d6e3>] relay_open+0x63/0x2c0
Oct 18 22:39:35 kvm kernel: [<ffffffff8a57cc2e>] do_blk_trace_setup+0x18e/0x2e0
Oct 18 22:39:35 kvm kernel: [<ffffffff8a57cf4f>] __blk_trace_setup+0x6f/0xe0
Oct 18 22:39:35 kvm kernel: [<ffffffff8a57e2c4>] blk_trace_ioctl+0xe4/0x160
Oct 18 22:39:35 kvm kernel: [<ffffffff8a768393>] blkdev_ioctl+0x533/0xa20
Oct 18 22:39:35 kvm kernel: [<ffffffff8a68ec41>] block_ioctl+0x41/0x50
Oct 18 22:39:35 kvm kernel: [<ffffffff8a6635c0>] do_vfs_ioctl+0x3a0/0x5b0
Oct 18 22:39:35 kvm kernel: [<ffffffff8a64ac2a>] ? __check_object_size+0x1ca/0x250
Oct 18 22:39:35 kvm kernel: [<ffffffff8a663871>] SyS_ioctl+0xa1/0xc0
Oct 18 22:39:35 kvm kernel: [<ffffffff8ab95f92>] system_call_fastpath+0x25/0x2a
Oct 18 22:39:35 kvm kernel: Mem-Info:
Oct 18 22:39:35 kvm kernel: active_anon:1423968 inactive_anon:273337 isolated_anon:0#012 active_file:1875 inactive_file:3035 isolated_file:0#012 unevictable:97 dirty:0 writeback:0 unstable:0#012 slab_reclaimable:24695 slab_unreclaimable:209997#012 mapped:2039 shmem:2659 pagetables:8555 bounce:0#012 free:2091832 free_pcp:0 free_cma:0
Oct 18 22:39:35 kvm kernel: Node 0 DMA free:15892kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15892kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Oct 18 22:39:35 kvm kernel: lowmem_reserve[]: 0 1977 31885 31885
Oct 18 22:39:35 kvm kernel: Node 0 DMA32 free:122612kB min:4188kB low:5232kB high:6280kB active_anon:7312kB inactive_anon:7632kB active_file:0kB inactive_file:0kB unevictable:388kB isolated(anon):0kB isolated(file):0kB present:2256136kB managed:2024468kB mlocked:0kB dirty:0kB writeback:0kB mapped:16kB shmem:388kB slab_reclaimable:3824kB slab_unreclaimable:32432kB kernel_stack:256kB pagetables:3332kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Oct 18 22:39:35 kvm kernel: lowmem_reserve[]: 0 0 29908 29908
Oct 18 22:39:35 kvm kernel: Node 0 Normal free:8228824kB min:63360kB low:79200kB high:95040kB active_anon:5688560kB inactive_anon:1085716kB active_file:7500kB inactive_file:12140kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:31170560kB managed:30628736kB mlocked:0kB dirty:0kB writeback:0kB mapped:8140kB shmem:10248kB slab_reclaimable:94956kB slab_unreclaimable:807556kB kernel_stack:4816kB pagetables:30888kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Oct 18 22:39:35 kvm kernel: lowmem_reserve[]: 0 0 0 0
Oct 18 22:39:35 kvm kernel: Node 0 DMA: 1*4kB (U) 2*8kB (U) 2*16kB (U) 1*32kB (U) 3*64kB (U) 2*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15892kB
Oct 18 22:39:35 kvm kernel: Node 0 DMA32: 3117*4kB (UEM) 3040*8kB (UEM) 1794*16kB (UEM) 9*32kB (UEM) 192*64kB (UM) 334*128kB (UM) 7*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 122612kB
Oct 18 22:39:35 kvm kernel: Node 0 Normal: 772778*4kB (UEM) 591957*8kB (UEM) 25004*16kB (UEM) 78*32kB (UEM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8229328kB
Oct 18 22:39:35 kvm kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 18 22:39:35 kvm kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 18 22:39:35 kvm kernel: 502660 total pagecache pages
Oct 18 22:39:35 kvm kernel: 495049 pages in swap cache
Oct 18 22:39:35 kvm kernel: Swap cache stats: add 1209489865, delete 1208826352, find 1779792608/2553325301
Oct 18 22:39:35 kvm kernel: Free swap = 8448360kB
Oct 18 22:39:35 kvm kernel: Total swap = 16777212kB
Oct 18 22:39:35 kvm kernel: 8360671 pages RAM
Oct 18 22:39:35 kvm kernel: 0 pages HighMem/MovableOnly
Oct 18 22:39:35 kvm kernel: 193397 pages reserved
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
Similar problem, 3x SAS3 15TB SSDs in raidz, stalling on multiple physical machines (5 out of 12) roughly same time with a similar load, Centos 7.9. Possible when memory is getting to the limit (but not exceeding - 230GB on 256GB RAM) OS stops responding for hours, reboot is needed.
Jan 5 02:46:13 gaiadb06 kernel: INFO: task spl_dynamic_tas:967 blocked for more than 120 seconds.
Jan 5 02:46:13 gaiadb06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 5 02:46:13 gaiadb06 kernel: spl_dynamic_tas D ffff99943fec3150 0 967 2 0x00000000
Jan 5 02:46:13 gaiadb06 kernel: Call Trace:
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28e0008>] ? __enqueue_entity+0x78/0x80
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28e6baf>] ? enqueue_entity+0x2ef/0xbe0
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb2f80a09>] schedule+0x29/0x70
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb2f7e511>] schedule_timeout+0x221/0x2d0
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb296e8ad>] ? tracing_record_cmdline+0x1d/0x120
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb297701b>] ? probe_sched_wakeup+0x2b/0xa0
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28d7845>] ? ttwu_do_wakeup+0xb5/0xe0
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb2f80dbd>] wait_for_completion+0xfd/0x140
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28db4c0>] ? wake_up_state+0x20/0x20
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1ee80>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28c604a>] kthread_create_on_node+0xaa/0x140
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb2b8d3fb>] ? string.isra.7+0x3b/0xf0
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1ee80>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1ee80>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jan 5 02:46:13 gaiadb06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 5 02:46:13 gaiadb06 kernel: spl_dynamic_tas D ffff99943fec3150 0 967 2 0x00000000
Jan 5 02:46:13 gaiadb06 kernel: Call Trace:
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28e0008>] ? __enqueue_entity+0x78/0x80
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28e6baf>] ? enqueue_entity+0x2ef/0xbe0
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb2f80a09>] schedule+0x29/0x70
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb2f7e511>] schedule_timeout+0x221/0x2d0
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb296e8ad>] ? tracing_record_cmdline+0x1d/0x120
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb297701b>] ? probe_sched_wakeup+0x2b/0xa0
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28d7845>] ? ttwu_do_wakeup+0xb5/0xe0
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb2f80dbd>] wait_for_completion+0xfd/0x140
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28db4c0>] ? wake_up_state+0x20/0x20
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1ee80>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28c604a>] kthread_create_on_node+0xaa/0x140
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb2b8d3fb>] ? string.isra.7+0x3b/0xf0
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1ee80>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1ee80>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b2065c>] spl_kthread_create+0x9c/0xf0 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1fd0b>] taskq_thread_create+0x6b/0x110 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1fdc2>] taskq_thread_spawn_task+0x12/0x40 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b2065c>] spl_kthread_create+0x9c/0xf0 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1fd0b>] taskq_thread_create+0x6b/0x110 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1fdc2>] taskq_thread_spawn_task+0x12/0x40 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1f146>] taskq_thread+0x2c6/0x520 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1f146>] taskq_thread+0x2c6/0x520 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28db4c0>] ? wake_up_state+0x20/0x20
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28db4c0>] ? wake_up_state+0x20/0x20
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1ee80>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffc0b1ee80>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28c61f1>] kthread+0xd1/0xe0
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28c61f1>] kthread+0xd1/0xe0
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28c6120>] ? insert_kthread_work+0x40/0x40
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28c6120>] ? insert_kthread_work+0x40/0x40
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb2f8dd37>] ret_from_fork_nospec_begin+0x21/0x21
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb2f8dd37>] ret_from_fork_nospec_begin+0x21/0x21
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28c6120>] ? insert_kthread_work+0x40/0x40
Jan 5 02:46:13 gaiadb06 kernel: [<ffffffffb28c6120>] ? insert_kthread_work+0x40/0x40
System information
Describe the problem you're observing
System degrades and becomes unresponsive, eventually displaying hung kernel task message on console similar to below.
Describe how to reproduce the problem
Unfortunately I do not have a reproducible set of conditions other than medium/high load on a production Zimbra mailserver with the mail store and mysql on zfs.
Include any warning/errors/backtraces from the system logs
Note: system running with following kernel config to force automatic panics and reboots in this condition in order to avoid manual intervention:
Most recent panic from 2019-12-04-18:34:14Z:
Prior panic from 2019-11-28-02:12:28Z:
Prior panic from 2019-11-24-06:37:14: