Sporadic system hang with unresponsive zfs

gerardba commented 6 years ago

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	9
Linux Kernel	4.9.0-4
Architecture	amd64
ZFS Version	0.7.4-1~bpo9+1
SPL Version	0.7.4-1~bpo9+1

Describe the problem you're observing

Kernel triggers hung task timeout. ZFS suddenly stalled, the zpool seems to be OK but can't get an 'ls' off it, dmesg showing kernel stuck on nfsd, attaching kernel call trace.

Over an hour after failing the zpool over to another server (single port NVME disk mapping changed), the server is still thinking the zpool is healthy. 'zfs get all' shows the zpool information but stalls showing the zfs dataset, attaching info below.

Any IO to the zfs dataset stalls.

This issue may be related to https://github.com/zfsonlinux/zfs/issues/7425

Describe how to reproduce the problem

This happened when exporting ZFS via NFS, compiling code on it.

zpool scrub after failing over to another server shows no errors on the zpool.

I still have the 'stuck server' on, in case I can get any further data off it.

Include any warning/errors/backtraces from the system logs

Kernel call trace:

Apr 24 12:04:19 kea2 kernel: [935612.760756] INFO: task nfsd:2786 blocked for more than 120 seconds.
Apr 24 12:04:19 kea2 kernel: [935612.767120]       Tainted: P           O    4.9.0-4-amd64 #1 Debian 4.9.51-1
Apr 24 12:04:19 kea2 kernel: [935612.774270] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 24 12:04:19 kea2 kernel: [935612.782200] nfsd            D    0  2786      2 0x00000000
Apr 24 12:04:19 kea2 kernel: [935612.782201]  ffff8f58dcff2400 0000000000000000 ffff8f17f0db2080 ffff8f18fecd8240
Apr 24 12:04:19 kea2 kernel: [935612.782202]  ffff8f17f1a58000 ffffb49d74bc39c8 ffffffff876038e3 ffff8f18c854f540
Apr 24 12:04:19 kea2 kernel: [935612.782203]  00ff8f1849d83000 ffff8f18fecd8240 ffffffffc123dd30 ffff8f17f0db2080
Apr 24 12:04:19 kea2 kernel: [935612.782204] Call Trace:
Apr 24 12:04:19 kea2 kernel: [935612.782206]  [<ffffffff876038e3>] ? __schedule+0x233/0x6d0
Apr 24 12:04:19 kea2 kernel: [935612.782227]  [<ffffffffc123dd30>] ? zfs_znode_hold_cache_constructor+0x20/0x50 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.782228]  [<ffffffff87603db2>] ? schedule+0x32/0x80
Apr 24 12:04:19 kea2 kernel: [935612.782229]  [<ffffffff8760404a>] ? schedule_preempt_disabled+0xa/0x10
Apr 24 12:04:19 kea2 kernel: [935612.782230]  [<ffffffff87605a94>] ? __mutex_lock_slowpath+0xb4/0x130
Apr 24 12:04:19 kea2 kernel: [935612.782231]  [<ffffffff87605b2b>] ? mutex_lock+0x1b/0x30
Apr 24 12:04:19 kea2 kernel: [935612.782252]  [<ffffffffc123e17d>] ? zfs_znode_hold_enter+0x11d/0x170 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.782255]  [<ffffffff875044bd>] ? dev_hard_start_xmit+0x9d/0x1d0
Apr 24 12:04:19 kea2 kernel: [935612.782279]  [<ffffffffc1241e1b>] ? zfs_zget+0x10b/0x240 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.782304]  [<ffffffffc12314a0>] ? zfs_vget+0x2a0/0x490 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.782331]  [<ffffffffc1251dc9>] ? zpl_fh_to_dentry+0x79/0xc0 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.782335]  [<ffffffff8728915e>] ? exportfs_decode_fh+0x5e/0x310
Apr 24 12:04:19 kea2 kernel: [935612.782341]  [<ffffffffc0834d30>] ? nfsd_proc_create+0x5e0/0x5e0 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.782345]  [<ffffffffc083a4aa>] ? exp_find_key+0x7a/0xc0 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.782347]  [<ffffffff870abef1>] ? update_curr+0xe1/0x160
Apr 24 12:04:19 kea2 kernel: [935612.782349]  [<ffffffff871e0f6c>] ? kmem_cache_alloc+0x11c/0x520
Apr 24 12:04:19 kea2 kernel: [935612.782353]  [<ffffffffc083a671>] ? exp_find+0x61/0xd0 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.782359]  [<ffffffffc083598d>] ? fh_verify+0x2cd/0x650 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.782369]  [<ffffffffc06c5780>] ? unix_gid_show+0xe0/0xe0 [sunrpc]
Apr 24 12:04:19 kea2 kernel: [935612.782377]  [<ffffffffc083eed8>] ? nfsd3_proc_getattr+0x78/0x100 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.782380]  [<ffffffffc083227d>] ? nfsd_dispatch+0xbd/0x250 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.782386]  [<ffffffffc06c106a>] ? svc_process_common+0x45a/0x690 [sunrpc]
Apr 24 12:04:19 kea2 kernel: [935612.782389]  [<ffffffffc0831c30>] ? nfsd_destroy+0x60/0x60 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.782394]  [<ffffffffc06c2186>] ? svc_process+0xf6/0x1a0 [sunrpc]
Apr 24 12:04:19 kea2 kernel: [935612.782398]  [<ffffffffc0831d19>] ? nfsd+0xe9/0x150 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.782400]  [<ffffffff87096697>] ? kthread+0xd7/0xf0
Apr 24 12:04:19 kea2 kernel: [935612.782401]  [<ffffffff870965c0>] ? kthread_park+0x60/0x60
Apr 24 12:04:19 kea2 kernel: [935612.782402]  [<ffffffff87608835>] ? ret_from_fork+0x25/0x30
Apr 24 12:04:19 kea2 kernel: [935612.782404] INFO: task nfsd:2787 blocked for more than 120 seconds.
Apr 24 12:04:19 kea2 kernel: [935612.788770]       Tainted: P           O    4.9.0-4-amd64 #1 Debian 4.9.51-1
Apr 24 12:04:19 kea2 kernel: [935612.795918] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 24 12:04:19 kea2 kernel: [935612.803851] nfsd            D    0  2787      2 0x00000000
Apr 24 12:04:19 kea2 kernel: [935612.803853]  ffff8f18e0552c00 0000000000000000 ffff8f17f145a0c0 ffff8f58fd058240
Apr 24 12:04:19 kea2 kernel: [935612.803854]  ffff8f58e4168080 ffffb49d74bd3910 ffffffff876038e3 ffffffff875f2fb1
Apr 24 12:04:19 kea2 kernel: [935612.803855]  00ff8f58cb460000 ffff8f58fd058240 ffff8f36f35d1600 ffff8f17f145a0c0
Apr 24 12:04:19 kea2 kernel: [935612.803856] Call Trace:
Apr 24 12:04:19 kea2 kernel: [935612.803858]  [<ffffffff876038e3>] ? __schedule+0x233/0x6d0
Apr 24 12:04:19 kea2 kernel: [935612.803859]  [<ffffffff875f2fb1>] ? packet_rcv+0x41/0x420
Apr 24 12:04:19 kea2 kernel: [935612.803860]  [<ffffffff87603db2>] ? schedule+0x32/0x80
Apr 24 12:04:19 kea2 kernel: [935612.803861]  [<ffffffff87606700>] ? rwsem_down_read_failed+0xf0/0x150
Apr 24 12:04:19 kea2 kernel: [935612.803864]  [<ffffffff873380d4>] ? call_rwsem_down_read_failed+0x14/0x30
Apr 24 12:04:19 kea2 kernel: [935612.803865]  [<ffffffff87605fbc>] ? down_read+0x1c/0x30
Apr 24 12:04:19 kea2 kernel: [935612.803885]  [<ffffffffc11b2502>] ? dnode_hold_impl+0x72/0xc50 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.803887]  [<ffffffffc07386ab>] ? avl_add+0x4b/0x80 [zavl]
Apr 24 12:04:19 kea2 kernel: [935612.803888]  [<ffffffff87605b1e>] ? mutex_lock+0xe/0x30
Apr 24 12:04:19 kea2 kernel: [935612.803910]  [<ffffffffc123e17d>] ? zfs_znode_hold_enter+0x11d/0x170 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.803929]  [<ffffffffc119cf0f>] ? dmu_bonus_hold+0x2f/0x1b0 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.803954]  [<ffffffffc1241e31>] ? zfs_zget+0x121/0x240 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.803979]  [<ffffffffc12314a0>] ? zfs_vget+0x2a0/0x490 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.804006]  [<ffffffffc1251dc9>] ? zpl_fh_to_dentry+0x79/0xc0 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.804012]  [<ffffffff8728915e>] ? exportfs_decode_fh+0x5e/0x310
Apr 24 12:04:19 kea2 kernel: [935612.804019]  [<ffffffffc0834d30>] ? nfsd_proc_create+0x5e0/0x5e0 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.804023]  [<ffffffffc083a4aa>] ? exp_find_key+0x7a/0xc0 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.804024]  [<ffffffff87550af4>] ? tcp_recvmsg+0x4a4/0xac0
Apr 24 12:04:19 kea2 kernel: [935612.804025]  [<ffffffff871e0f6c>] ? kmem_cache_alloc+0x11c/0x520
Apr 24 12:04:19 kea2 kernel: [935612.804029]  [<ffffffffc083a671>] ? exp_find+0x61/0xd0 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.804032]  [<ffffffffc083598d>] ? fh_verify+0x2cd/0x650 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.804045]  [<ffffffffc06c5780>] ? unix_gid_show+0xe0/0xe0 [sunrpc]
Apr 24 12:04:19 kea2 kernel: [935612.804052]  [<ffffffffc083eed8>] ? nfsd3_proc_getattr+0x78/0x100 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.804059]  [<ffffffffc083227d>] ? nfsd_dispatch+0xbd/0x250 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.804065]  [<ffffffffc06c106a>] ? svc_process_common+0x45a/0x690 [sunrpc]
Apr 24 12:04:19 kea2 kernel: [935612.804068]  [<ffffffffc0831c30>] ? nfsd_destroy+0x60/0x60 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.804073]  [<ffffffffc06c2186>] ? svc_process+0xf6/0x1a0 [sunrpc]
Apr 24 12:04:19 kea2 kernel: [935612.804076]  [<ffffffffc0831d19>] ? nfsd+0xe9/0x150 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.804078]  [<ffffffff87096697>] ? kthread+0xd7/0xf0
Apr 24 12:04:19 kea2 kernel: [935612.804080]  [<ffffffff870965c0>] ? kthread_park+0x60/0x60
Apr 24 12:04:19 kea2 kernel: [935612.804081]  [<ffffffff87608835>] ? ret_from_fork+0x25/0x30
Apr 24 12:04:19 kea2 kernel: [935612.804083] INFO: task nfsd:2788 blocked for more than 120 seconds.
Apr 24 12:04:19 kea2 kernel: [935612.810448]       Tainted: P           O    4.9.0-4-amd64 #1 Debian 4.9.51-1
Apr 24 12:04:19 kea2 kernel: [935612.817597] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 24 12:04:19 kea2 kernel: [935612.825525] nfsd            D    0  2788      2 0x00000000
Apr 24 12:04:19 kea2 kernel: [935612.825527]  ffff8f58dcff2400 0000000000000000 ffff8f17f0f30100 ffff8f18ff398240
Apr 24 12:04:19 kea2 kernel: [935612.825529]  ffff8f18e5262000 ffffb49d74bdb9c8 ffffffff876038e3 ffff8f18c854f540
Apr 24 12:04:19 kea2 kernel: [935612.825535]  00ff8f1849d83000 ffff8f18ff398240 ffffffffc123dd30 ffff8f17f0f30100
Apr 24 12:04:19 kea2 kernel: [935612.825536] Call Trace:
Apr 24 12:04:19 kea2 kernel: [935612.825538]  [<ffffffff876038e3>] ? __schedule+0x233/0x6d0
Apr 24 12:04:19 kea2 kernel: [935612.825563]  [<ffffffffc123dd30>] ? zfs_znode_hold_cache_constructor+0x20/0x50 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.825568]  [<ffffffff87603db2>] ? schedule+0x32/0x80
Apr 24 12:04:19 kea2 kernel: [935612.825572]  [<ffffffff8760404a>] ? schedule_preempt_disabled+0xa/0x10
Apr 24 12:04:19 kea2 kernel: [935612.825575]  [<ffffffff87605a94>] ? __mutex_lock_slowpath+0xb4/0x130
Apr 24 12:04:19 kea2 kernel: [935612.825577]  [<ffffffff87605b2b>] ? mutex_lock+0x1b/0x30
Apr 24 12:04:19 kea2 kernel: [935612.825598]  [<ffffffffc123e17d>] ? zfs_znode_hold_enter+0x11d/0x170 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.825606]  [<ffffffff875479d3>] ? ip_finish_output2+0x183/0x370
Apr 24 12:04:19 kea2 kernel: [935612.825627]  [<ffffffffc1241e1b>] ? zfs_zget+0x10b/0x240 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.825632]  [<ffffffff87546010>] ? ip_forward_options+0x1b0/0x1b0
Apr 24 12:04:19 kea2 kernel: [935612.825633]  [<ffffffff87547f67>] ? ip_local_out+0x17/0x40
Apr 24 12:04:19 kea2 kernel: [935612.825654]  [<ffffffffc12314a0>] ? zfs_vget+0x2a0/0x490 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.825676]  [<ffffffffc1251dc9>] ? zpl_fh_to_dentry+0x79/0xc0 [zfs]
Apr 24 12:04:19 kea2 kernel: [935612.825677]  [<ffffffff8728915e>] ? exportfs_decode_fh+0x5e/0x310
Apr 24 12:04:19 kea2 kernel: [935612.825681]  [<ffffffffc0834d30>] ? nfsd_proc_create+0x5e0/0x5e0 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.825685]  [<ffffffffc083a4aa>] ? exp_find_key+0x7a/0xc0 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.825686]  [<ffffffff871e0f6c>] ? kmem_cache_alloc+0x11c/0x520
Apr 24 12:04:19 kea2 kernel: [935612.825689]  [<ffffffffc083a671>] ? exp_find+0x61/0xd0 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.825692]  [<ffffffffc083598d>] ? fh_verify+0x2cd/0x650 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.825700]  [<ffffffffc06c5780>] ? unix_gid_show+0xe0/0xe0 [sunrpc]
Apr 24 12:04:19 kea2 kernel: [935612.825705]  [<ffffffffc083eed8>] ? nfsd3_proc_getattr+0x78/0x100 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.825709]  [<ffffffffc083227d>] ? nfsd_dispatch+0xbd/0x250 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.825715]  [<ffffffffc06c106a>] ? svc_process_common+0x45a/0x690 [sunrpc]
Apr 24 12:04:19 kea2 kernel: [935612.825717]  [<ffffffffc0831c30>] ? nfsd_destroy+0x60/0x60 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.825723]  [<ffffffffc06c2186>] ? svc_process+0xf6/0x1a0 [sunrpc]
Apr 24 12:04:19 kea2 kernel: [935612.825726]  [<ffffffffc0831d19>] ? nfsd+0xe9/0x150 [nfsd]
Apr 24 12:04:19 kea2 kernel: [935612.825728]  [<ffffffff87096697>] ? kthread+0xd7/0xf0
Apr 24 12:04:19 kea2 kernel: [935612.825729]  [<ffffffff870965c0>] ? kthread_park+0x60/0x60
Apr 24 12:04:19 kea2 kernel: [935612.825731]  [<ffffffff87608835>] ? ret_from_fork+0x25/0x30

I tried getting a stack trace before failing over the zpool to another server:

root@kea2:~# echo l > /proc/sysrq-trigger
Apr 24 12:52:17 kea2 kernel: [938490.982312] sysrq: SysRq : Show backtrace of all active CPUs
Apr 24 12:52:17 kea2 kernel: [938490.988127] NMI backtrace for cpu 70
Apr 24 12:52:17 kea2 kernel: [938490.988129] CPU: 70 PID: 76016 Comm: bash Tainted: P           O    4.9.0-4-amd64 #1 Debian 4.9.51-1
Apr 24 12:52:17 kea2 kernel: [938490.988129] Hardware name: Supermicro SYS-6029U-TRT/X11DPU, BIOS 2.0b 02/09/2018
Apr 24 12:52:17 kea2 kernel: [938490.988130]  0000000000000000 ffffffff87329974 0000000000000000 0000000000000046
Apr 24 12:52:17 kea2 kernel: [938490.988132]  ffffffff8732dd79 ffffffff8704c5d0 0000000000000046 0000000000000004
Apr 24 12:52:17 kea2 kernel: [938490.988133]  0000000000000000 ffffffff8732de74 000000000000006c ffffffff87cb99e0
Apr 24 12:52:17 kea2 kernel: [938490.988134] Call Trace:
Apr 24 12:52:17 kea2 kernel: [938490.988139]  [<ffffffff87329974>] ? dump_stack+0x5c/0x78
Apr 24 12:52:17 kea2 kernel: [938490.988140]  [<ffffffff8732dd79>] ? nmi_cpu_backtrace+0x89/0x90
Apr 24 12:52:17 kea2 kernel: [938490.988144]  [<ffffffff8704c5d0>] ? irq_force_complete_move+0x140/0x140
Apr 24 12:52:17 kea2 kernel: [938490.988145]  [<ffffffff8732de74>] ? nmi_trigger_cpumask_backtrace+0xf4/0x120
Apr 24 12:52:17 kea2 kernel: [938490.988147]  [<ffffffff8741fc31>] ? __handle_sysrq+0xf1/0x140
Apr 24 12:52:17 kea2 kernel: [938490.988148]  [<ffffffff8742006b>] ? write_sysrq_trigger+0x2b/0x30
Apr 24 12:52:17 kea2 kernel: [938490.988150]  [<ffffffff872716dd>] ? proc_reg_write+0x3d/0x60
Apr 24 12:52:17 kea2 kernel: [938490.988152]  [<ffffffff87202cb0>] ? vfs_write+0xb0/0x190
Apr 24 12:52:17 kea2 kernel: [938490.988153]  [<ffffffff872040a2>] ? SyS_write+0x52/0xc0
Apr 24 12:52:17 kea2 kernel: [938490.988157]  [<ffffffff876085bb>] ? system_call_fast_compare_end+0xc/0x9b
Apr 24 12:52:17 kea2 kernel: [938490.988158] Sending NMI from CPU 70 to CPUs 0-69,71:
Apr 24 12:52:17 kea2 kernel: [938490.988220] NMI backtrace for cpu 0 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.988319] NMI backtrace for cpu 1 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.988403] NMI backtrace for cpu 2 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.988500] NMI backtrace for cpu 3 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.988598] NMI backtrace for cpu 4 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.988697] NMI backtrace for cpu 5 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.988762] NMI backtrace for cpu 6 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.988819] NMI backtrace for cpu 7 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.988862] NMI backtrace for cpu 8 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.988997] NMI backtrace for cpu 9 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.989060] NMI backtrace for cpu 10 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.989160] NMI backtrace for cpu 11 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.989260] NMI backtrace for cpu 12 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.989367] NMI backtrace for cpu 13
Apr 24 12:52:17 kea2 kernel: [938490.989368] CPU: 13 PID: 983 Comm: systemd-journal Tainted: P           O    4.9.0-4-amd64 #1 Debian 4.9.51-1
Apr 24 12:52:17 kea2 kernel: [938490.989369] Hardware name: Supermicro SYS-6029U-TRT/X11DPU, BIOS 2.0b 02/09/2018
Apr 24 12:52:17 kea2 kernel: [938490.989369] task: ffff8f58dc28e3c0 task.stack: ffffb49d73f28000
Apr 24 12:52:17 kea2 kernel: [938490.989369] RIP: 0033:[<00007f7aaf103a80>] c [<00007f7aaf103a80>] 0x7f7aaf103a80
Apr 24 12:52:17 kea2 kernel: [938490.989370] RSP: 002b:00007ffed6a89850  EFLAGS: 00000246
Apr 24 12:52:17 kea2 kernel: [938490.989370] RAX: 000055a7e192e010 RBX: 0000000001b351a0 RCX: 0000000001b351a0
Apr 24 12:52:17 kea2 kernel: [938490.989371] RDX: 0000000000000000 RSI: 000000000000001b RDI: 000055a7e1917440
Apr 24 12:52:17 kea2 kernel: [938490.989371] RBP: 000055a7e1913040 R08: 0000000001b351b0 R09: 000055a7e192e010
Apr 24 12:52:17 kea2 kernel: [938490.989371] R10: e7a10db12f0f5fb9 R11: b64aa90045d068c3 R12: 0000000000000006
Apr 24 12:52:17 kea2 kernel: [938490.989371] R13: 0000000000000003 R14: 000000000000001b R15: 000055a7e1917440
Apr 24 12:52:17 kea2 kernel: [938490.989372] FS:  00007f7aaf4c18c0(0000) GS:ffff8f18fef40000(0000) knlGS:0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.989372] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 24 12:52:17 kea2 kernel: [938490.989372] CR2: 0000000001bef000 CR3: 0000003f1715e000 CR4: 00000000007406e0
Apr 24 12:52:17 kea2 kernel: [938490.989372] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.989373] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 24 12:52:17 kea2 kernel: [938490.989373] PKRU: 55555554
Apr 24 12:52:17 kea2 kernel: [938490.989459] NMI backtrace for cpu 14 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.989558] NMI backtrace for cpu 15 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.989625] NMI backtrace for cpu 16 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.989745] NMI backtrace for cpu 17 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.989792] NMI backtrace for cpu 18 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.989796] NMI backtrace for cpu 19 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.989901] NMI backtrace for cpu 20 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.989993] NMI backtrace for cpu 21 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.990092] NMI backtrace for cpu 22 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.990155] NMI backtrace for cpu 23 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.990288] NMI backtrace for cpu 24 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.990388] NMI backtrace for cpu 25 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.990487] NMI backtrace for cpu 26 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.990587] NMI backtrace for cpu 27 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.990686] NMI backtrace for cpu 28 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.990786] NMI backtrace for cpu 29 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.990888] NMI backtrace for cpu 30 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.990949] NMI backtrace for cpu 31 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.991113] NMI backtrace for cpu 32 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.991148] NMI backtrace for cpu 33 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.991248] NMI backtrace for cpu 34 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.991405] NMI backtrace for cpu 35 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.991448] NMI backtrace for cpu 36 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.991557] NMI backtrace for cpu 37
Apr 24 12:52:17 kea2 kernel: [938490.991558] CPU: 37 PID: 983 Comm: systemd-journal Tainted: P           O    4.9.0-4-amd64 #1 Debian 4.9.51-1
Apr 24 12:52:17 kea2 kernel: [938490.991559] Hardware name: Supermicro SYS-6029U-TRT/X11DPU, BIOS 2.0b 02/09/2018
Apr 24 12:52:17 kea2 kernel: [938490.991559] task: ffff8f58dc28e3c0 task.stack: ffffb49d73f28000
Apr 24 12:52:17 kea2 kernel: [938490.991559] RIP: 0033:[<00007f7aae6f6bb0>] c [<00007f7aae6f6bb0>] 0x7f7aae6f6bb0
Apr 24 12:52:17 kea2 kernel: [938490.991560] RSP: 002b:00007ffed6a89880  EFLAGS: 00000246
Apr 24 12:52:17 kea2 kernel: [938490.991560] RAX: 00007ffed6a89ae0 RBX: 0000000000000000 RCX: 0000000000000010
Apr 24 12:52:17 kea2 kernel: [938490.991561] RDX: a1fcdcf819d7e1e5 RSI: 00007ffed6a89ad0 RDI: 00007ffed6a89990
Apr 24 12:52:17 kea2 kernel: [938490.991561] RBP: 0000000000000001 R08: 0000000000000010 R09: 00007ffed6a89990
Apr 24 12:52:17 kea2 kernel: [938490.991561] R10: 00007ffed6a89b90 R11: 00007f7aae82c0e0 R12: 00007ffed6a89ab0
Apr 24 12:52:17 kea2 kernel: [938490.991561] R13: 00007f7aaf1ccd50 R14: 00007ffed6a89ae0 R15: 00007ffed6a89990
Apr 24 12:52:17 kea2 kernel: [938490.991562] FS:  00007f7aaf4c18c0(0000) GS:ffff8f18ff0c0000(0000) knlGS:0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.991562] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 24 12:52:17 kea2 kernel: [938490.991562] CR2: 00007fc50be595d8 CR3: 0000003f1715e000 CR4: 00000000007406e0
Apr 24 12:52:17 kea2 kernel: [938490.991562] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.991563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 24 12:52:17 kea2 kernel: [938490.991563] PKRU: 55555554
Apr 24 12:52:17 kea2 kernel: [938490.991680] NMI backtrace for cpu 39 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.991700] NMI backtrace for cpu 38
Apr 24 12:52:17 kea2 kernel: [938490.991701] CPU: 38 PID: 1696 Comm: rs:main Q:Reg Tainted: P           O    4.9.0-4-amd64 #1 Debian 4.9.51-1
Apr 24 12:52:17 kea2 kernel: [938490.991702] Hardware name: Supermicro SYS-6029U-TRT/X11DPU, BIOS 2.0b 02/09/2018
Apr 24 12:52:17 kea2 kernel: [938490.991702] task: ffff8f18df980900 task.stack: ffffb49d763f8000
Apr 24 12:52:17 kea2 kernel: [938490.991702] RIP: 0010:[<ffffffff8707c688>] c [<ffffffff8707c688>] __local_bh_enable_ip+0x18/0x80
Apr 24 12:52:17 kea2 kernel: [938490.991703] RSP: 0018:ffffb49d763fba80  EFLAGS: 00000246
Apr 24 12:52:17 kea2 kernel: [938490.991703] RAX: 0000000000000246 RBX: 0000000000000200 RCX: 0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.991703] RDX: 000000000000008d RSI: 0000000000000200 RDI: ffffffffc0686e48
Apr 24 12:52:17 kea2 kernel: [938490.991704] RBP: ffff8f53ad50c1e0 R08: 0000000000000079 R09: 0000000000000014
Apr 24 12:52:17 kea2 kernel: [938490.991704] R10: 0000000000000024 R11: ffff8f58e088f000 R12: ffff8f13f9f00a00
Apr 24 12:52:17 kea2 kernel: [938490.991704] R13: ffff8f08c667fe10 R14: ffff8f58e088f000 R15: ffffffffc068a5d8
Apr 24 12:52:17 kea2 kernel: [938490.991704] FS:  00007f24f7fff700(0000) GS:ffff8f18ff100000(0000) knlGS:0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.991705] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 24 12:52:17 kea2 kernel: [938490.991705] CR2: 00007fdecc4cc160 CR3: 0000003f1fd9f000 CR4: 00000000007406e0
Apr 24 12:52:17 kea2 kernel: [938490.991705] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.991706] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 24 12:52:17 kea2 kernel: [938490.991706] PKRU: 55555554
Apr 24 12:52:17 kea2 kernel: [938490.991706] Stack:
Apr 24 12:52:17 kea2 kernel: [938490.991706]  ffff8f53ad50c170c ffffffffc0686e64c aee1f0cd62072fd1c 0000000000000000c
Apr 24 12:52:17 kea2 kernel: [938490.991707]  0000000000000000c ffffb49d763fbb50c ffffb49d763fbc04c 0000000000000000c
Apr 24 12:52:17 kea2 kernel: [938490.991707]  00000001a82997c0c ffff8f53ad50c000c ffff8f53ad50c040c ffff8edc9a12b620c
Apr 24 12:52:17 kea2 kernel: [938490.991707] Call Trace:
Apr 24 12:52:17 kea2 kernel: [938490.991708]  [<ffffffffc0686e64>] ? ipt_do_table+0x334/0x620 [ip_tables]
Apr 24 12:52:17 kea2 kernel: [938490.991708]  [<ffffffff8733ef43>] ? copy_from_iter+0x93/0x370
Apr 24 12:52:17 kea2 kernel: [938490.991710]  [<ffffffff8753ba1e>] ? nf_iterate+0x4e/0x60
Apr 24 12:52:17 kea2 kernel: [938490.991713]  [<ffffffff8753ba9d>] ? nf_hook_slow+0x6d/0xc0
Apr 24 12:52:17 kea2 kernel: [938490.991716]  [<ffffffff87547ee3>] ? __ip_local_out+0xe3/0x150
Apr 24 12:52:17 kea2 kernel: [938490.991718]  [<ffffffff87546010>] ? ip_forward_options+0x1b0/0x1b0
Apr 24 12:52:17 kea2 kernel: [938490.991719]  [<ffffffff87547f67>] ? ip_local_out+0x17/0x40
Apr 24 12:52:17 kea2 kernel: [938490.991721]  [<ffffffff87549155>] ? ip_send_skb+0x15/0x40
Apr 24 12:52:17 kea2 kernel: [938490.991723]  [<ffffffff87570a6a>] ? udp_send_skb+0xaa/0x260
Apr 24 12:52:17 kea2 kernel: [938490.991725]  [<ffffffff87572063>] ? udp_sendmsg+0x2e3/0x8f0
Apr 24 12:52:17 kea2 kernel: [938490.991727]  [<ffffffff8717e6da>] ? __generic_file_write_iter+0x16a/0x1b0
Apr 24 12:52:17 kea2 kernel: [938490.991731]  [<ffffffff874e4880>] ? sock_sendmsg+0x30/0x40
Apr 24 12:52:17 kea2 kernel: [938490.991734]  [<ffffffff874e4e03>] ? SYSC_sendto+0xd3/0x150
Apr 24 12:52:17 kea2 kernel: [938490.991734]  [<ffffffff87202d54>] ? vfs_write+0x154/0x190
Apr 24 12:52:17 kea2 kernel: [938490.991734]  [<ffffffff872040c6>] ? SyS_write+0x76/0xc0
Apr 24 12:52:17 kea2 kernel: [938490.991735]  [<ffffffff876085bb>] ? system_call_fast_compare_end+0xc/0x9b
Apr 24 12:52:17 kea2 kernel: [938490.991735] Code: c48 c89 cdf c57 c9d c0f c1f c44 c00 c00 c5b cc3 c0f c1f c80 c00 c00 c00 c00 c0f c1f c44 c00 c00 c53 c89 cf3 c65 c8b c05 c91 c0c cf9 c78 ca9 c00 c00 c0f c00 c75 c2f c9c c58 c<0f> c1f c44 c00 c00 cf6 cc4 c02 c74 c23 c83 ceb c01 cf7 cdb c65 c01 c1d c72 c0c cf9 c
Apr 24 12:52:17 kea2 kernel: [938490.991795] NMI backtrace for cpu 40 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.991803] NMI backtrace for cpu 41 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.991844] NMI backtrace for cpu 42 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.991891] NMI backtrace for cpu 43 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.991941] NMI backtrace for cpu 44 skipped: idling at pc 0xffffffff8760814a
Apr 24 12:52:17 kea2 kernel: [938490.991979] NMI backtrace for cpu 45 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992042] NMI backtrace for cpu 47 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992093] NMI backtrace for cpu 46 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992177] NMI backtrace for cpu 49 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992179] NMI backtrace for cpu 48 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992241] NMI backtrace for cpu 50 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992288] NMI backtrace for cpu 51 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992377] NMI backtrace for cpu 53 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992379] NMI backtrace for cpu 52 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992454] NMI backtrace for cpu 55 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992490] NMI backtrace for cpu 54 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992552] NMI backtrace for cpu 56
Apr 24 12:52:17 kea2 kernel: [938490.992553] CPU: 56 PID: 1696 Comm: rs:main Q:Reg Tainted: P           O    4.9.0-4-amd64 #1 Debian 4.9.51-1
Apr 24 12:52:17 kea2 kernel: [938490.992553] Hardware name: Supermicro SYS-6029U-TRT/X11DPU, BIOS 2.0b 02/09/2018
Apr 24 12:52:17 kea2 kernel: [938490.992553] task: ffff8f18df980900 task.stack: ffffb49d763f8000
Apr 24 12:52:17 kea2 kernel: [938490.992554] RIP: 0033:[<00007f2510698162>] c [<00007f2510698162>] 0x7f2510698162
Apr 24 12:52:17 kea2 kernel: [938490.992554] RSP: 002b:00007f24f7ffec70  EFLAGS: 00000246
Apr 24 12:52:17 kea2 kernel: [938490.992554] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f251069815f
Apr 24 12:52:17 kea2 kernel: [938490.992555] RDX: 0000000000165f11 RSI: 0000000000000080 RDI: 0000000000000001
Apr 24 12:52:17 kea2 kernel: [938490.992555] RBP: 00005602e0005858 R08: 00005602e0005600 R09: 00000000000b2f88
Apr 24 12:52:17 kea2 kernel: [938490.992555] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f24f7ffecd0
Apr 24 12:52:17 kea2 kernel: [938490.992555] R13: 0000000000000000 R14: 00005602dfb9c290 R15: 0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.992556] FS:  00007f24f7fff700(0000) GS:ffff8f58fd500000(0000) knlGS:0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.992556] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 24 12:52:17 kea2 kernel: [938490.992556] CR2: 00007f9b1e74aab4 CR3: 0000003f1fd9f000 CR4: 00000000007406e0
Apr 24 12:52:17 kea2 kernel: [938490.992557] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.992557] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 24 12:52:17 kea2 kernel: [938490.992557] PKRU: 55555554
Apr 24 12:52:17 kea2 kernel: [938490.992655] NMI backtrace for cpu 57 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992774] NMI backtrace for cpu 58 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992840] NMI backtrace for cpu 59 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.992974] NMI backtrace for cpu 60 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.993073] NMI backtrace for cpu 61 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.993139] NMI backtrace for cpu 62 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.993283] NMI backtrace for cpu 63 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.993373] NMI backtrace for cpu 64
Apr 24 12:52:17 kea2 kernel: [938490.993374] CPU: 64 PID: 983 Comm: systemd-journal Tainted: P           O    4.9.0-4-amd64 #1 Debian 4.9.51-1
Apr 24 12:52:17 kea2 kernel: [938490.993374] Hardware name: Supermicro SYS-6029U-TRT/X11DPU, BIOS 2.0b 02/09/2018
Apr 24 12:52:17 kea2 kernel: [938490.993375] task: ffff8f58dc28e3c0 task.stack: ffffb49d73f28000
Apr 24 12:52:17 kea2 kernel: [938490.993375] RIP: 0010:[<ffffffff870cebba>] c [<ffffffff870cebba>] devkmsg_poll+0x6a/0x80
Apr 24 12:52:17 kea2 kernel: [938490.993375] RSP: 0018:ffffb49d73f2bd88  EFLAGS: 00000006
Apr 24 12:52:17 kea2 kernel: [938490.993376] RAX: 00000000000013a3 RBX: ffff8f18d713c000 RCX: 0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.993376] RDX: 0000000000000001 RSI: ffffb49d73f2bda8 RDI: ffffffff87eab428
Apr 24 12:52:17 kea2 kernel: [938490.993377] RBP: 0000000000000000 R08: ffff8f18e31c3218 R09: 0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.993377] R10: 00000000000b6529 R11: 0000000000000cf5 R12: ffffb49d73f2be98
Apr 24 12:52:17 kea2 kernel: [938490.993377] R13: ffff8f18e3217d80 R14: 0000000000000000 R15: ffff8f18e31c3218
Apr 24 12:52:17 kea2 kernel: [938490.993377] FS:  00007f7aaf4c18c0(0000) GS:ffff8f58fd700000(0000) knlGS:0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.993378] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 24 12:52:17 kea2 kernel: [938490.993378] CR2: 00007f1d3af732c0 CR3: 0000003f1715e000 CR4: 00000000007406e0
Apr 24 12:52:17 kea2 kernel: [938490.993378] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 24 12:52:17 kea2 kernel: [938490.993379] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 24 12:52:17 kea2 kernel: [938490.993379] PKRU: 55555554
Apr 24 12:52:17 kea2 kernel: [938490.993379] Stack:
Apr 24 12:52:17 kea2 kernel: [938490.993379]  ffffb49d73f2be08c ffffffff8724ad23c 00007ffed6a8c6b0c ffff8f18e3217de0c
Apr 24 12:52:17 kea2 kernel: [938490.993380]  0000000000000000c 0000000000000019c 888105b9ca53147ec ffffffff8724aca0c
Apr 24 12:52:17 kea2 kernel: [938490.993380]  ffff8f18e3217de0c ffffb49d73f2be08c 0000000000000000c ffffb49d73f2be98c
Apr 24 12:52:17 kea2 kernel: [938490.993380] Call Trace:
Apr 24 12:52:17 kea2 kernel: [938490.993380]  [<ffffffff8724ad23>] ? ep_send_events_proc+0x83/0x190
Apr 24 12:52:17 kea2 kernel: [938490.993381]  [<ffffffff8724aca0>] ? ep_ptable_queue_proc+0x90/0x90
Apr 24 12:52:17 kea2 kernel: [938490.993381]  [<ffffffff8724b17e>] ? ep_scan_ready_list.constprop.12+0x9e/0x210
Apr 24 12:52:17 kea2 kernel: [938490.993381]  [<ffffffff8724b482>] ? ep_poll+0x192/0x350
Apr 24 12:52:17 kea2 kernel: [938490.993381]  [<ffffffff870a18c0>] ? wake_up_q+0x70/0x70
Apr 24 12:52:17 kea2 kernel: [938490.993382]  [<ffffffff8724ce11>] ? SyS_epoll_wait+0xc1/0xe0
Apr 24 12:52:17 kea2 kernel: [938490.993382]  [<ffffffff87003b1c>] ? do_syscall_64+0x7c/0xf0
Apr 24 12:52:17 kea2 kernel: [938490.993382]  [<ffffffff8760866f>] ? entry_SYSCALL64_slow_path+0x25/0x25
Apr 24 12:52:17 kea2 kernel: [938490.993383] Code: c31 cc9 c48 c3b c05 ca8 cf5 cdf c00 c72 c19 c48 cc7 cc7 c28 cb4 cea c87 cc6 c07 c00 c0f c1f c40 c00 cfb c66 c0f c1f c44 c00 c00 c89 cc8 c5b cc3 c48 c3b c05 c96 cf5 cdf c00 c<19> cc9 c83 ce1 c0a c83 cc1 c41 ceb cd6 c66 c90 c66 c2e c0f c1f c84 c00 c00 c00 c00 c
Apr 24 12:52:17 kea2 kernel: [938490.993470] NMI backtrace for cpu 65 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.993571] NMI backtrace for cpu 66 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.993686] NMI backtrace for cpu 67 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.993770] NMI backtrace for cpu 68 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.993885] NMI backtrace for cpu 69 skipped: idling at pc 0xffffffff87608045
Apr 24 12:52:17 kea2 kernel: [938490.993970] NMI backtrace for cpu 71 skipped: idling at pc 0xffffffff87608045

Stuck 'zfs get all':

NAME              PROPERTY              VALUE                                        SOURCE
zpool             type                  filesystem                                   -
zpool             creation              Fri Mar  2 16:12 2018                        -
zpool             used                  5.49T                                        -
zpool             available             41.7T                                        -
zpool             referenced            236K                                         -
zpool             compressratio         1.85x                                        -
zpool             mounted               yes                                          -
zpool             quota                 none                                         default
zpool             reservation           none                                         default
zpool             recordsize            128K                                         default
zpool             mountpoint            /zpool                                       default
zpool             sharenfs              off                                          default
zpool             checksum              on                                           default
zpool             compression           off                                          default
zpool             atime                 on                                           default
zpool             devices               on                                           default
zpool             exec                  on                                           default
zpool             setuid                on                                           default
zpool             readonly              off                                          default
zpool             zoned                 off                                          default
zpool             snapdir               hidden                                       default
zpool             aclinherit            restricted                                   default
zpool             createtxg             1                                            -
zpool             canmount              on                                           default
zpool             xattr                 on                                           default
zpool             copies                1                                            default
zpool             version               5                                            -
zpool             utf8only              off                                          -
zpool             normalization         none                                         -
zpool             casesensitivity       sensitive                                    -
zpool             vscan                 off                                          default
zpool             nbmand                off                                          default
zpool             sharesmb              off                                          default
zpool             refquota              none                                         default
zpool             refreservation        none                                         default
zpool             guid                  6671384642039161331                          -
zpool             primarycache          all                                          default
zpool             secondarycache        all                                          default
zpool             usedbysnapshots       0B                                           -
zpool             usedbydataset         236K                                         -
zpool             usedbychildren        5.49T                                        -
zpool             usedbyrefreservation  0B                                           -
zpool             logbias               latency                                      default
zpool             dedup                 off                                          default
zpool             mlslabel              none                                         default
zpool             sync                  standard                                     default
zpool             dnodesize             legacy                                       default
zpool             refcompressratio      1.00x                                        -
zpool             written               236K                                         -
zpool             logicalused           10.3T                                        -
zpool             logicalreferenced     40K                                          -
zpool             volmode               default                                      default
zpool             filesystem_limit      none                                         default
zpool             snapshot_limit        none                                         default
zpool             filesystem_count      none                                         default
zpool             snapshot_count        none                                         default
zpool             snapdev               hidden                                       default
zpool             acltype               off                                          default
zpool             context               none                                         default
zpool             fscontext             none                                         default
zpool             defcontext            none                                         default
zpool             rootcontext           none                                         default
zpool             relatime              off                                          default
zpool             redundant_metadata    all                                          default
zpool             overlay               off                                          default
zpool/buildcache  type                  filesystem                                   -
zpool/buildcache  creation              Fri Mar  2 16:14 2018                        -
zpool/buildcache  used                  5.46T                                        -
zpool/buildcache  available             41.7T                                        -
zpool/buildcache  referenced            5.46T                                        -
zpool/buildcache  compressratio         1.85x                                        -
zpool/buildcache  mounted               yes                                          -
zpool/buildcache  quota                 none                                         default
zpool/buildcache  reservation           none                                         default
zpool/buildcache  recordsize            512K                                         local
zpool/buildcache  mountpoint            /zpool/buildcache                            default
zpool/buildcache  sharenfs              rw=@10.0.0.0/255.0.0.0,async,no_root_squash  local
zpool/buildcache  checksum              skein                                        local
zpool/buildcache  compression           lz4                                          local
zpool/buildcache  atime                 off                                          local
zpool/buildcache  devices               on                                           default
zpool/buildcache  exec                  on                                           default
zpool/buildcache  setuid                on                                           default
zpool/buildcache  readonly              off                                          default
zpool/buildcache  zoned                 off                                          default
zpool/buildcache  snapdir               hidden                                       default
zpool/buildcache  aclinherit            restricted                                   default
zpool/buildcache  createtxg             25                                           -
zpool/buildcache  canmount              on                                           default
zpool/buildcache  xattr                 off                                          local
zpool/buildcache  copies                1                                            default
^C

Stracing the 'zfs get all' shows it got stuck at:

root@kea2:~# strace -f zfs get all
...
write(1, "zpool/buildcache  copies        "..., 93zpool/buildcache  copies                1                                            default
) = 93
ioctl(3, _IOC(0, 0x5a, 0x13, 0x00)

Even after the disks are gone from the server, ZFS thinks the zpool is healthy:

root@kea2:~# zpool status
  pool: zpool
 state: ONLINE
  scan: scrub repaired 0B in 0h17m with 0 errors on Sun Apr  8 00:41:31 2018
config:

    NAME          STATE     READ WRITE CKSUM
    zpool         ONLINE       0     0     0
      raidz2-0    ONLINE       0     0     0
        nvme0n1   ONLINE       0     0     0
        nvme1n1   ONLINE       0     0     0
        nvme2n1   ONLINE       0     0     0
        nvme3n1   ONLINE       0     0     0
        nvme4n1   ONLINE       0     0     0
        nvme5n1   ONLINE       0     0     0
        nvme6n1   ONLINE       0     0     0
        nvme7n1   ONLINE       0     0     0
        nvme8n1   ONLINE       0     0     0
        nvme9n1   ONLINE       0     0     0
        nvme10n1  ONLINE       0     0     0
        nvme11n1  ONLINE       0     0     0
        nvme12n1  ONLINE       0     0     0
        nvme13n1  ONLINE       0     0     0
        nvme14n1  ONLINE       0     0     0
        nvme15n1  ONLINE       0     0     0

errors: No known data errors
root@kea2:~#

And here is a full 'zfs get all':

root@kea1:~# zfs get all
NAME              PROPERTY              VALUE                                        SOURCE
zpool             type                  filesystem                                   -
zpool             creation              Fri Mar  2 16:12 2018                        -
zpool             used                  5.58T                                        -
zpool             available             41.6T                                        -
zpool             referenced            236K                                         -
zpool             compressratio         1.86x                                        -
zpool             mounted               yes                                          -
zpool             quota                 none                                         default
zpool             reservation           none                                         default
zpool             recordsize            128K                                         default
zpool             mountpoint            /zpool                                       default
zpool             sharenfs              off                                          default
zpool             checksum              on                                           default
zpool             compression           off                                          default
zpool             atime                 on                                           default
zpool             devices               on                                           default
zpool             exec                  on                                           default
zpool             setuid                on                                           default
zpool             readonly              off                                          default
zpool             zoned                 off                                          default
zpool             snapdir               hidden                                       default
zpool             aclinherit            restricted                                   default
zpool             createtxg             1                                            -
zpool             canmount              on                                           default
zpool             xattr                 on                                           default
zpool             copies                1                                            default
zpool             version               5                                            -
zpool             utf8only              off                                          -
zpool             normalization         none                                         -
zpool             casesensitivity       sensitive                                    -
zpool             vscan                 off                                          default
zpool             nbmand                off                                          default
zpool             sharesmb              off                                          default
zpool             refquota              none                                         default
zpool             refreservation        none                                         default
zpool             guid                  6671384642039161331                          -
zpool             primarycache          all                                          default
zpool             secondarycache        all                                          default
zpool             usedbysnapshots       0B                                           -
zpool             usedbydataset         236K                                         -
zpool             usedbychildren        5.58T                                        -
zpool             usedbyrefreservation  0B                                           -
zpool             logbias               latency                                      default
zpool             dedup                 off                                          default
zpool             mlslabel              none                                         default
zpool             sync                  standard                                     default
zpool             dnodesize             legacy                                       default
zpool             refcompressratio      1.00x                                        -
zpool             written               236K                                         -
zpool             logicalused           10.5T                                        -
zpool             logicalreferenced     40K                                          -
zpool             volmode               default                                      default
zpool             filesystem_limit      none                                         default
zpool             snapshot_limit        none                                         default
zpool             filesystem_count      none                                         default
zpool             snapshot_count        none                                         default
zpool             snapdev               hidden                                       default
zpool             acltype               off                                          default
zpool             context               none                                         default
zpool             fscontext             none                                         default
zpool             defcontext            none                                         default
zpool             rootcontext           none                                         default
zpool             relatime              off                                          default
zpool             redundant_metadata    all                                          default
zpool             overlay               off                                          default
zpool/buildcache  type                  filesystem                                   -
zpool/buildcache  creation              Fri Mar  2 16:14 2018                        -
zpool/buildcache  used                  5.55T                                        -
zpool/buildcache  available             41.6T                                        -
zpool/buildcache  referenced            5.55T                                        -
zpool/buildcache  compressratio         1.86x                                        -
zpool/buildcache  mounted               yes                                          -
zpool/buildcache  quota                 none                                         default
zpool/buildcache  reservation           none                                         default
zpool/buildcache  recordsize            512K                                         local
zpool/buildcache  mountpoint            /zpool/buildcache                            default
zpool/buildcache  sharenfs              rw=@10.0.0.0/255.0.0.0,async,no_root_squash  local
zpool/buildcache  checksum              skein                                        local
zpool/buildcache  compression           lz4                                          local
zpool/buildcache  atime                 off                                          local
zpool/buildcache  devices               on                                           default
zpool/buildcache  exec                  on                                           default
zpool/buildcache  setuid                on                                           default
zpool/buildcache  readonly              off                                          default
zpool/buildcache  zoned                 off                                          default
zpool/buildcache  snapdir               hidden                                       default
zpool/buildcache  aclinherit            restricted                                   default
zpool/buildcache  createtxg             25                                           -
zpool/buildcache  canmount              on                                           default
zpool/buildcache  xattr                 off                                          local
zpool/buildcache  copies                1                                            default
zpool/buildcache  version               5                                            -
zpool/buildcache  utf8only              off                                          -
zpool/buildcache  normalization         none                                         -
zpool/buildcache  casesensitivity       sensitive                                    -
zpool/buildcache  vscan                 off                                          default
zpool/buildcache  nbmand                off                                          default
zpool/buildcache  sharesmb              off                                          default
zpool/buildcache  refquota              none                                         default
zpool/buildcache  refreservation        none                                         default
zpool/buildcache  guid                  581466235394039950                           -
zpool/buildcache  primarycache          all                                          default
zpool/buildcache  secondarycache        all                                          default
zpool/buildcache  usedbysnapshots       0B                                           -
zpool/buildcache  usedbydataset         5.55T                                        -
zpool/buildcache  usedbychildren        0B                                           -
zpool/buildcache  usedbyrefreservation  0B                                           -
zpool/buildcache  logbias               latency                                      default
zpool/buildcache  dedup                 skein                                        local
zpool/buildcache  mlslabel              none                                         default
zpool/buildcache  sync                  disabled                                     local
zpool/buildcache  dnodesize             legacy                                       default
zpool/buildcache  refcompressratio      1.86x                                        -
zpool/buildcache  written               5.55T                                        -
zpool/buildcache  logicalused           10.5T                                        -
zpool/buildcache  logicalreferenced     10.5T                                        -
zpool/buildcache  volmode               default                                      default
zpool/buildcache  filesystem_limit      none                                         default
zpool/buildcache  snapshot_limit        none                                         default
zpool/buildcache  filesystem_count      none                                         default
zpool/buildcache  snapshot_count        none                                         default
zpool/buildcache  snapdev               hidden                                       default
zpool/buildcache  acltype               off                                          default
zpool/buildcache  context               none                                         default
zpool/buildcache  fscontext             none                                         default
zpool/buildcache  defcontext            none                                         default
zpool/buildcache  rootcontext           none                                         default
zpool/buildcache  relatime              off                                          default
zpool/buildcache  redundant_metadata    all                                          default
zpool/buildcache  overlay               off                                          default
root@kea1:~#

gerardba commented 6 years ago

Attaching dmesg output of echo t > /proc/sysrq-trigger a few hours after the zpool disks are not there anymore and zpool status still thinks it's all good. kea2.trace.gz

gerardba commented 6 years ago

A few more occurrences and further investigation I think this may be related to https://github.com/zfsonlinux/zfs/issues/7057 and I suspect triggered by a flaky cable from the server to the NVMe JBOF (single port).

I see the following stuck for over 24 hours now:

#crash > bt 20331
PID: 20331  TASK: ffffa10e5e6b81c0  CPU: 65  COMMAND: "z_wr_iss"
 #0 [ffffbd3ca42f7b58] __schedule at ffffffffa4683340
 #1 [ffffbd3ca42f7be8] schedule at ffffffffa46837a2
 #2 [ffffbd3ca42f7bf8] rwsem_down_write_failed at ffffffffa4686530
 #3 [ffffbd3ca42f7c98] call_rwsem_down_write_failed at ffffffffa4679723
 #4 [ffffbd3ca42f7cd8] down_write at ffffffffa4685d10
 #5 [ffffbd3ca42f7ce0] dbuf_write_ready at ffffffffc0acb1ff [zfs]
 #6 [ffffbd3ca42f7d18] arc_write_ready at ffffffffc0ac5efb [zfs]
 #7 [ffffbd3ca42f7d68] zio_ready at ffffffffc0b7c7a7 [zfs]
 #8 [ffffbd3ca42f7dc0] zio_execute at ffffffffc0b776e7 [zfs]
 #9 [ffffbd3ca42f7de8] taskq_thread at ffffffffc05f7209 [spl]
#10 [ffffbd3ca42f7f10] kthread at ffffffffa409394c
#11 [ffffbd3ca42f7f50] ret_from_fork at ffffffffa48001f5
crash> bt 20693
PID: 20693  TASK: ffffa14e3cf44900  CPU: 64  COMMAND: "txg_sync"
 #0 [ffffbd3ca56a3be0] __schedule at ffffffffa4683340
 #1 [ffffbd3ca56a3c70] schedule at ffffffffa46837a2
 #2 [ffffbd3ca56a3c80] io_schedule at ffffffffa409ee02
 #3 [ffffbd3ca56a3c90] cv_wait_common at ffffffffc05fa9dc [spl]
 #4 [ffffbd3ca56a3ce8] zio_wait at ffffffffc0b79e20 [zfs]
 #5 [ffffbd3ca56a3d18] dsl_pool_sync at ffffffffc0afc248 [zfs]
 #6 [ffffbd3ca56a3d80] spa_sync at ffffffffc0b18d37 [zfs]
 #7 [ffffbd3ca56a3e38] txg_sync_thread at ffffffffc0b2ac36 [zfs]
 #8 [ffffbd3ca56a3ee8] thread_generic_wrapper at ffffffffc05f61ff [spl]
 #9 [ffffbd3ca56a3f10] kthread at ffffffffa409394c
#10 [ffffbd3ca56a3f50] ret_from_fork at ffffffffa48001f5

I even removed the disks from the system and the zpool is happily running on another server, but this server is totally stuck on the above kernel tasks and reporting as if the zpool was still available and healthy (as per zpool list and zfs list).

It seems like if a ZFS is stuck eternally trying to do a txg_sync_thread. I would expect it to realize that the disk is gone/failed and eventually return an error (and perhaps retry?), but instead it just hangs forever and with it the whole ZFS subsystem.

bunder2015 commented 6 years ago

     failmode=wait|continue|panic
             Controls the system behavior in the event of catastrophic pool failure.  This condition is typically a result of a loss of connectivity to the underlying storage device(s) or a failure of all devices within the pool.
             The behavior of such an event is determined as follows:

             wait      Blocks all I/O access until the device connectivity is recovered and the errors are cleared.  This is the default behavior.

             continue  Returns EIO to any new write I/O requests but allows reads to any of the remaining healthy devices.  Any write requests that have yet to be committed to disk would be blocked.

             panic     Prints out a message to the console and generates a system crash dump.

gerardba commented 6 years ago

My understanding is that for a different 'failmode' to help I'd have to see errors in zpool status right?

Right now status looks clean even though the disks have been gone for over 24h:

root@kea2:~# zpool status
  pool: zpool
 state: ONLINE
  scan: scrub repaired 0B in 2h43m with 0 errors on Sun Aug 12 03:07:52 2018
config:

    NAME          STATE     READ WRITE CKSUM
    zpool         ONLINE       0     0     0
      raidz2-0    ONLINE       0     0     0
        nvme12n1  ONLINE       0     0     0
        nvme13n1  ONLINE       0     0     0
        nvme14n1  ONLINE       0     0     0
        nvme15n1  ONLINE       0     0     0
        nvme0n1   ONLINE       0     0     0
        nvme1n1   ONLINE       0     0     0
        nvme2n1   ONLINE       0     0     0
        nvme3n1   ONLINE       0     0     0
        nvme4n1   ONLINE       0     0     0
        nvme5n1   ONLINE       0     0     0
        nvme6n1   ONLINE       0     0     0
        nvme7n1   ONLINE       0     0     0
        nvme8n1   ONLINE       0     0     0
        nvme9n1   ONLINE       0     0     0
        nvme10n1  ONLINE       0     0     0
        nvme11n1  ONLINE       0     0     0

errors: No known data errors
root@kea2:~# nvme list
root@kea2:~#

gerardba commented 5 years ago

After upgrading to zfs 0.8.0-rc3_126_g5dbf8b4ed (compiled from master) I'm still seeing the same/similar issue on kernel 4.14.97. A fresh stack trace:

[ 4669.240201] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4669.247068] nfsd            D    0  2742      2 0x80000000
[ 4669.247069] Call Trace:
[ 4669.247071]  ? __schedule+0x3b4/0x7e0
[ 4669.247072]  ? packet_rcv+0x41/0x430
[ 4669.247073]  schedule+0x32/0x80
[ 4669.247074]  schedule_preempt_disabled+0xa/0x10
[ 4669.247075]  __mutex_lock.isra.2+0x279/0x4a0
[ 4669.247077]  ? spl_kmem_cache_alloc+0xfe/0x860 [spl]
[ 4669.247079]  ? sch_direct_xmit+0x62/0x1a0
[ 4669.247105]  ? zfs_znode_hold_enter+0x107/0x150 [zfs]
[ 4669.247132]  zfs_znode_hold_enter+0x107/0x150 [zfs]
[ 4669.247136]  ? bond_start_xmit+0x168/0x3d0 [bonding]
[ 4669.247139]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[ 4669.247162]  zfs_zget+0xfb/0x220 [zfs]
[ 4669.247164]  ? __dev_queue_xmit+0x283/0x630
[ 4669.247165]  ? getboottime64+0x22/0x30
[ 4669.247168]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[ 4669.247193]  zfs_vget+0x1fc/0x390 [zfs]
[ 4669.247197]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[ 4669.247222]  zpl_fh_to_dentry+0x58/0xa0 [zfs]
[ 4669.247226]  exportfs_decode_fh+0x49/0x290
[ 4669.247230]  ? exp_find_key+0x6a/0xa0 [nfsd]
[ 4669.247231]  ? kmem_cache_alloc+0xff/0x510
[ 4669.247235]  ? exp_find+0x5b/0xb0 [nfsd]
[ 4669.247238]  fh_verify+0x2a6/0x5e0 [nfsd]
[ 4669.247241]  nfsd_open+0x2c/0x180 [nfsd]
[ 4669.247244]  nfsd_write+0x4e/0x1e0 [nfsd]
[ 4669.247248]  nfsd3_proc_write+0x8d/0x110 [nfsd]
[ 4669.247251]  nfsd_dispatch+0xb1/0x240 [nfsd]
[ 4669.247257]  svc_process_common+0x395/0x6e0 [sunrpc]
[ 4669.247263]  svc_process+0xf6/0x1a0 [sunrpc]
[ 4669.247266]  nfsd+0xe3/0x150 [nfsd]
[ 4669.247267]  kthread+0xff/0x130
[ 4669.247270]  ? nfsd_destroy+0x60/0x60 [nfsd]
[ 4669.247271]  ? kthread_create_on_node+0x40/0x40
[ 4669.247272]  ret_from_fork+0x35/0x40

In the process list we have z_wr_iss, txg_sync and nfsd in D state, here are some traces from them:

crash> bt 3002
PID: 3002   TASK: ffff9396f6d81840  CPU: 9   COMMAND: "nfsd"
 #0 [ffffac47e40df818] __schedule at ffffffff89694074
 #1 [ffffac47e40df8a8] schedule at ffffffff896944d2
 #2 [ffffac47e40df8b8] rwsem_down_read_failed at ffffffff8969772b
 #3 [ffffac47e40df938] call_rwsem_down_read_failed at ffffffff8968a314
 #4 [ffffac47e40df980] down_read at ffffffff89696a33
 #5 [ffffac47e40df988] dnode_hold_impl at ffffffffc70e71e9 [zfs]
 #6 [ffffac47e40dfa30] dmu_buf_hold_noread at ffffffffc70d1dd7 [zfs]
 #7 [ffffac47e40dfa60] dmu_buf_hold at ffffffffc70d1ec7 [zfs]
 #8 [ffffac47e40dfa90] zap_lockdir at ffffffffc715ef87 [zfs]
 #9 [ffffac47e40dfad0] zap_lookup_norm at ffffffffc715f3b0 [zfs]
#10 [ffffac47e40dfb20] zap_lookup at ffffffffc715f422 [zfs]
#11 [ffffac47e40dfb48] zfs_dirent_lock at ffffffffc716cb2b [zfs]
#12 [ffffac47e40dfbc0] zfs_dirlook at ffffffffc716cbc7 [zfs]
#13 [ffffac47e40dfc20] zfs_lookup at ffffffffc718583d [zfs]
#14 [ffffac47e40dfc78] zpl_lookup at ffffffffc71aa5b9 [zfs]
#15 [ffffac47e40dfcf8] __lookup_hash at ffffffff891fb762
#16 [ffffac47e40dfd28] lookup_one_len at ffffffff891fb876
#17 [ffffac47e40dfd60] nfsd_lookup_dentry at ffffffffc07dae5d [nfsd]
#18 [ffffac47e40dfdd0] nfsd_lookup at ffffffffc07db18b [nfsd]
#19 [ffffac47e40dfe18] nfsd3_proc_lookup at ffffffffc07e3220 [nfsd]
#20 [ffffac47e40dfe48] nfsd_dispatch at ffffffffc07d7151 [nfsd]
#21 [ffffac47e40dfe78] svc_process_common at ffffffffc0637525 [sunrpc]
#22 [ffffac47e40dfed0] svc_process at ffffffffc06386b6 [sunrpc]
#23 [ffffac47e40dfef0] nfsd at ffffffffc07d6c03 [nfsd]
#24 [ffffac47e40dff10] kthread at ffffffff8909680f
#25 [ffffac47e40dff50] ret_from_fork at ffffffff898001f5
crash> bt 3001
PID: 3001   TASK: ffff9396f6c81800  CPU: 51  COMMAND: "nfsd"
 #0 [ffffac47e40c7818] __schedule at ffffffff89694074
 #1 [ffffac47e40c78a8] schedule at ffffffff896944d2
 #2 [ffffac47e40c78b8] rwsem_down_read_failed at ffffffff8969772b
 #3 [ffffac47e40c7938] call_rwsem_down_read_failed at ffffffff8968a314
 #4 [ffffac47e40c7980] down_read at ffffffff89696a33
 #5 [ffffac47e40c7988] dnode_hold_impl at ffffffffc70e71e9 [zfs]
 #6 [ffffac47e40c7a30] dmu_object_alloc_impl at ffffffffc70d3df3 [zfs]
 #7 [ffffac47e40c7ad0] dmu_object_alloc_dnsize at ffffffffc70d401c [zfs]
 #8 [ffffac47e40c7af8] zfs_mknode at ffffffffc7191460 [zfs]
 #9 [ffffac47e40c7c28] zfs_create at ffffffffc7185edc [zfs]
#10 [ffffac47e40c7ce8] zpl_create at ffffffffc71aaed4 [zfs]
#11 [ffffac47e40c7d48] vfs_create at ffffffff891fc79c
#12 [ffffac47e40c7d80] do_nfsd_create at ffffffffc07dd817 [nfsd]
#13 [ffffac47e40c7df8] nfsd3_proc_create at ffffffffc07e3978 [nfsd]
#14 [ffffac47e40c7e48] nfsd_dispatch at ffffffffc07d7151 [nfsd]
#15 [ffffac47e40c7e78] svc_process_common at ffffffffc0637525 [sunrpc]
#16 [ffffac47e40c7ed0] svc_process at ffffffffc06386b6 [sunrpc]
#17 [ffffac47e40c7ef0] nfsd at ffffffffc07d6c03 [nfsd]
#18 [ffffac47e40c7f10] kthread at ffffffff8909680f
#19 [ffffac47e40c7f50] ret_from_fork at ffffffff898001f5
crash> bt 3000
PID: 3000   TASK: ffff9396f73457c0  CPU: 43  COMMAND: "nfsd"
 #0 [ffffac47e409f888] __schedule at ffffffff89694074
 #1 [ffffac47e409f918] schedule at ffffffff896944d2
 #2 [ffffac47e409f928] rwsem_down_read_failed at ffffffff8969772b
 #3 [ffffac47e409f9a8] call_rwsem_down_read_failed at ffffffff8968a314
 #4 [ffffac47e409f9f0] down_read at ffffffff89696a33
 #5 [ffffac47e409f9f8] dbuf_read at ffffffffc70c6486 [zfs]
 #6 [ffffac47e409fa60] dnode_hold_impl at ffffffffc70e7241 [zfs]
 #7 [ffffac47e409fb08] dmu_bonus_hold at ffffffffc70cf812 [zfs]
 #8 [ffffac47e409fb30] zfs_zget at ffffffffc7193991 [zfs]
 #9 [ffffac47e409fbc0] zfs_vget at ffffffffc718236c [zfs]
#10 [ffffac47e409fc08] zpl_fh_to_dentry at ffffffffc71a8fa8 [zfs]
#11 [ffffac47e409fc28] exportfs_decode_fh at ffffffff892cfa99
#12 [ffffac47e409fd70] fh_verify at ffffffffc07da536 [nfsd]
#13 [ffffac47e409fdd0] nfsd_lookup at ffffffffc07db15f [nfsd]
#14 [ffffac47e409fe18] nfsd3_proc_lookup at ffffffffc07e3220 [nfsd]
#15 [ffffac47e409fe48] nfsd_dispatch at ffffffffc07d7151 [nfsd]
#16 [ffffac47e409fe78] svc_process_common at ffffffffc0637525 [sunrpc]
#17 [ffffac47e409fed0] svc_process at ffffffffc06386b6 [sunrpc]
#18 [ffffac47e409fef0] nfsd at ffffffffc07d6c03 [nfsd]
#19 [ffffac47e409ff10] kthread at ffffffff8909680f
#20 [ffffac47e409ff50] ret_from_fork at ffffffff898001f5
crash>
crash> bt 2971
PID: 2971   TASK: ffff9396f8db90c0  CPU: 61  COMMAND: "nfsd"
 #0 [ffffac47e02778a8] __schedule at ffffffff89694074
 #1 [ffffac47e0277938] schedule at ffffffff896944d2
 #2 [ffffac47e0277948] rwsem_down_read_failed at ffffffff8969772b
 #3 [ffffac47e02779c0] call_rwsem_down_read_failed at ffffffff8968a314
 #4 [ffffac47e0277a08] down_read at ffffffff89696a33
 #5 [ffffac47e0277a10] dnode_hold_impl at ffffffffc70e71e9 [zfs]
 #6 [ffffac47e0277ab8] dmu_bonus_hold at ffffffffc70cf812 [zfs]
 #7 [ffffac47e0277ae0] zfs_zget at ffffffffc7193991 [zfs]
 #8 [ffffac47e0277b70] zfs_vget at ffffffffc718236c [zfs]
 #9 [ffffac47e0277bb8] zpl_fh_to_dentry at ffffffffc71a8fa8 [zfs]
#10 [ffffac47e0277bd8] exportfs_decode_fh at ffffffff892cfa99
#11 [ffffac47e0277d20] fh_verify at ffffffffc07da536 [nfsd]
#12 [ffffac47e0277d80] do_nfsd_create at ffffffffc07dd4e5 [nfsd]
#13 [ffffac47e0277df8] nfsd3_proc_create at ffffffffc07e3978 [nfsd]
#14 [ffffac47e0277e48] nfsd_dispatch at ffffffffc07d7151 [nfsd]
#15 [ffffac47e0277e78] svc_process_common at ffffffffc0637525 [sunrpc]
#16 [ffffac47e0277ed0] svc_process at ffffffffc06386b6 [sunrpc]
#17 [ffffac47e0277ef0] nfsd at ffffffffc07d6c03 [nfsd]
#18 [ffffac47e0277f10] kthread at ffffffff8909680f
#19 [ffffac47e0277f50] ret_from_fork at ffffffff898001f5
crash> bt 2893
PID: 2893   TASK: ffff9396fde49a40  CPU: 4   COMMAND: "nfsd"
 #0 [ffffac47cfa3b8a8] __schedule at ffffffff89694074
 #1 [ffffac47cfa3b938] schedule at ffffffff896944d2
 #2 [ffffac47cfa3b948] rwsem_down_read_failed at ffffffff8969772b
 #3 [ffffac47cfa3b9c0] call_rwsem_down_read_failed at ffffffff8968a314
 #4 [ffffac47cfa3ba08] down_read at ffffffff89696a33
 #5 [ffffac47cfa3ba10] dnode_hold_impl at ffffffffc70e71e9 [zfs]
 #6 [ffffac47cfa3bab8] dmu_bonus_hold at ffffffffc70cf812 [zfs]
 #7 [ffffac47cfa3bae0] zfs_zget at ffffffffc7193991 [zfs]
 #8 [ffffac47cfa3bb70] zfs_vget at ffffffffc718236c [zfs]
 #9 [ffffac47cfa3bbb8] zpl_fh_to_dentry at ffffffffc71a8fa8 [zfs]
#10 [ffffac47cfa3bbd8] exportfs_decode_fh at ffffffff892cfa99
#11 [ffffac47cfa3bd20] fh_verify at ffffffffc07da536 [nfsd]
#12 [ffffac47cfa3bd80] do_nfsd_create at ffffffffc07dd4e5 [nfsd]
#13 [ffffac47cfa3bdf8] nfsd3_proc_create at ffffffffc07e3978 [nfsd]
#14 [ffffac47cfa3be48] nfsd_dispatch at ffffffffc07d7151 [nfsd]
#15 [ffffac47cfa3be78] svc_process_common at ffffffffc0637525 [sunrpc]
#16 [ffffac47cfa3bed0] svc_process at ffffffffc06386b6 [sunrpc]
#17 [ffffac47cfa3bef0] nfsd at ffffffffc07d6c03 [nfsd]
#18 [ffffac47cfa3bf10] kthread at ffffffff8909680f
#19 [ffffac47cfa3bf50] ret_from_fork at ffffffff898001f5
crash> bt 10127
PID: 10127  TASK: ffff9396dd2cc740  CPU: 21  COMMAND: "txg_sync"
 #0 [ffffac47e976fb38] __schedule at ffffffff89694074
 #1 [ffffac47e976fbc8] schedule at ffffffff896944d2
 #2 [ffffac47e976fbd8] schedule_timeout at ffffffff89697d12
 #3 [ffffac47e976fc60] io_schedule_timeout at ffffffff89694879
 #4 [ffffac47e976fc78] __cv_timedwait_common at ffffffffc09d53d2 [spl]
 #5 [ffffac47e976fcd8] zio_wait at ffffffffc719f684 [zfs]
 #6 [ffffac47e976fd28] dsl_pool_sync at ffffffffc7102c59 [zfs]
 #7 [ffffac47e976fda0] spa_sync at ffffffffc7126a20 [zfs]
 #8 [ffffac47e976fe40] txg_sync_thread at ffffffffc713b4e1 [zfs]
 #9 [ffffac47e976fee8] thread_generic_wrapper at ffffffffc09dc1af [spl]
#10 [ffffac47e976ff10] kthread at ffffffff8909680f
#11 [ffffac47e976ff50] ret_from_fork at ffffffff898001f5
crash> bt 9335
PID: 9335   TASK: ffff9356ae530880  CPU: 19  COMMAND: "z_wr_iss"
 #0 [ffffac47e70cfb48] __schedule at ffffffff89694074
 #1 [ffffac47e70cfbd8] schedule at ffffffff896944d2
 #2 [ffffac47e70cfbe8] rwsem_down_write_failed at ffffffff896972a0
 #3 [ffffac47e70cfc88] call_rwsem_down_write_failed at ffffffff8968a343
 #4 [ffffac47e70cfcc8] down_write at ffffffff89696a70
 #5 [ffffac47e70cfcd0] dbuf_write_ready at ffffffffc70c4c81 [zfs]
 #6 [ffffac47e70cfd08] arc_write_ready at ffffffffc70bdfc4 [zfs]
 #7 [ffffac47e70cfd60] zio_ready at ffffffffc71a17b5 [zfs]
 #8 [ffffac47e70cfdb8] zio_execute at ffffffffc719c53d [zfs]
 #9 [ffffac47e70cfde8] taskq_thread at ffffffffc09db479 [spl]
#10 [ffffac47e70cff10] kthread at ffffffff8909680f
#11 [ffffac47e70cff50] ret_from_fork at ffffffff898001f5
crash> bt 9331
PID: 9331   TASK: ffff9356ae520780  CPU: 60  COMMAND: "z_wr_iss"
 #0 [ffffac47e70afb48] __schedule at ffffffff89694074
 #1 [ffffac47e70afbd8] schedule at ffffffff896944d2
 #2 [ffffac47e70afbe8] rwsem_down_write_failed at ffffffff896972a0
 #3 [ffffac47e70afc88] call_rwsem_down_write_failed at ffffffff8968a343
 #4 [ffffac47e70afcc8] down_write at ffffffff89696a70
 #5 [ffffac47e70afcd0] dbuf_write_ready at ffffffffc70c4c81 [zfs]
 #6 [ffffac47e70afd08] arc_write_ready at ffffffffc70bdfc4 [zfs]
 #7 [ffffac47e70afd60] zio_ready at ffffffffc71a17b5 [zfs]
 #8 [ffffac47e70afdb8] zio_execute at ffffffffc719c53d [zfs]
 #9 [ffffac47e70afde8] taskq_thread at ffffffffc09db479 [spl]
#10 [ffffac47e70aff10] kthread at ffffffff8909680f
#11 [ffffac47e70aff50] ret_from_fork at ffffffff898001f5
crash> bt 9293
PID: 9293   TASK: ffff9356aea35b80  CPU: 22  COMMAND: "z_wr_iss"
 #0 [ffffac47e50ffb48] __schedule at ffffffff89694074
 #1 [ffffac47e50ffbd8] schedule at ffffffff896944d2
 #2 [ffffac47e50ffbe8] rwsem_down_write_failed at ffffffff896972a0
 #3 [ffffac47e50ffc88] call_rwsem_down_write_failed at ffffffff8968a343
 #4 [ffffac47e50ffcc8] down_write at ffffffff89696a70
 #5 [ffffac47e50ffcd0] dbuf_write_ready at ffffffffc70c4c81 [zfs]
 #6 [ffffac47e50ffd08] arc_write_ready at ffffffffc70bdfc4 [zfs]
 #7 [ffffac47e50ffd60] zio_ready at ffffffffc71a17b5 [zfs]
 #8 [ffffac47e50ffdb8] zio_execute at ffffffffc719c53d [zfs]
 #9 [ffffac47e50ffde8] taskq_thread at ffffffffc09db479 [spl]
#10 [ffffac47e50fff10] kthread at ffffffff8909680f
#11 [ffffac47e50fff50] ret_from_fork at ffffffff898001f5

Now I also know how to reproduce it - running parallel mdtest from a few nodes at the same time, eg from 5 nodes running:

time mpirun -np 32 mdtest -d /mnt/zpool/mdtest-run-$HOSTNAME -i 2 -I 1024 -z 1 -b 512 -L -u -F

Where /mnt/zpool is a ZFS dataset exported with sharenfs=rw=@10.0.0.0/255.0.0.0,async,no_root_squash

With 0.7.13 it would get stuck (with another error) early on in the parallel file creation phase. With 0.8 looks like it's getting stuck once it starts removing files.

Is there any other useful information I can provide to help debug this issue?

gerardba commented 5 years ago

Now with 0.8 and deadman enabled before the hang there are some zpool events and it took a lot longer (many hours) for ZFS to get stuck while running parallel mdtest on top of some regular, concurrent, read/write activity.

Here's the dmesg trace:

[Sun Mar 31 02:45:50 2019] INFO: task nfsd:2845 blocked for more than 120 seconds.
[Sun Mar 31 02:45:50 2019]       Tainted: P           O    4.14.97.hrtdev #1
[Sun Mar 31 02:45:50 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 31 02:45:50 2019] nfsd            D    0  2845      2 0x80000000
[Sun Mar 31 02:45:50 2019] Call Trace:
[Sun Mar 31 02:45:50 2019]  ? __schedule+0x3b4/0x7e0
[Sun Mar 31 02:45:50 2019]  ? packet_rcv+0x41/0x430
[Sun Mar 31 02:45:50 2019]  schedule+0x32/0x80
[Sun Mar 31 02:45:50 2019]  schedule_preempt_disabled+0xa/0x10
[Sun Mar 31 02:45:50 2019]  __mutex_lock.isra.2+0x279/0x4a0
[Sun Mar 31 02:45:50 2019]  ? spl_kmem_cache_alloc+0xfe/0x860 [spl]
[Sun Mar 31 02:45:50 2019]  ? sch_direct_xmit+0x62/0x1a0
[Sun Mar 31 02:45:50 2019]  ? zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  ? bond_start_xmit+0x168/0x3d0 [bonding]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_zget+0xfb/0x220 [zfs]
[Sun Mar 31 02:45:50 2019]  ? __dev_queue_xmit+0x283/0x630
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_vget+0x1fc/0x390 [zfs]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zpl_fh_to_dentry+0x58/0xa0 [zfs]
[Sun Mar 31 02:45:50 2019]  exportfs_decode_fh+0x49/0x290
[Sun Mar 31 02:45:50 2019]  ? exp_find_key+0x6a/0xa0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kmem_cache_alloc+0xff/0x510
[Sun Mar 31 02:45:50 2019]  ? exp_find+0x5b/0xb0 [nfsd]
[Sun Mar 31 02:45:50 2019]  fh_verify+0x2a6/0x5e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  nfsd_open+0x2c/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_write+0x4e/0x1e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd3_proc_write+0x8d/0x110 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_dispatch+0xb1/0x240 [nfsd]
[Sun Mar 31 02:45:50 2019]  svc_process_common+0x395/0x6e0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  svc_process+0xf6/0x1a0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  nfsd+0xe3/0x150 [nfsd]
[Sun Mar 31 02:45:50 2019]  kthread+0xff/0x130
[Sun Mar 31 02:45:50 2019]  ? nfsd_destroy+0x60/0x60 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kthread_create_on_node+0x40/0x40
[Sun Mar 31 02:45:50 2019]  ret_from_fork+0x35/0x40
[Sun Mar 31 02:45:50 2019] INFO: task nfsd:2846 blocked for more than 120 seconds.
[Sun Mar 31 02:45:50 2019]       Tainted: P           O    4.14.97.hrtdev #1
[Sun Mar 31 02:45:50 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 31 02:45:50 2019] nfsd            D    0  2846      2 0x80000000
[Sun Mar 31 02:45:50 2019] Call Trace:
[Sun Mar 31 02:45:50 2019]  ? __schedule+0x3b4/0x7e0
[Sun Mar 31 02:45:50 2019]  ? dev_hard_start_xmit+0xa0/0x1e0
[Sun Mar 31 02:45:50 2019]  schedule+0x32/0x80
[Sun Mar 31 02:45:50 2019]  rwsem_down_read_failed+0xfb/0x140
[Sun Mar 31 02:45:50 2019]  ? call_rwsem_down_read_failed+0x14/0x30
[Sun Mar 31 02:45:50 2019]  ? spl_kmem_cache_alloc+0xfe/0x860 [spl]
[Sun Mar 31 02:45:50 2019]  call_rwsem_down_read_failed+0x14/0x30
[Sun Mar 31 02:45:50 2019]  ? zpl_encode_fh+0xa0/0xa0 [zfs]
[Sun Mar 31 02:45:50 2019]  down_read+0x13/0x30
[Sun Mar 31 02:45:50 2019]  dnode_hold_impl+0xc9/0xc50 [zfs]
[Sun Mar 31 02:45:50 2019]  ? zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  dmu_bonus_hold+0x22/0x60 [zfs]
[Sun Mar 31 02:45:50 2019]  zfs_zget+0x111/0x220 [zfs]
[Sun Mar 31 02:45:50 2019]  ? sch_direct_xmit+0x62/0x1a0
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_vget+0x1fc/0x390 [zfs]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zpl_fh_to_dentry+0x58/0xa0 [zfs]
[Sun Mar 31 02:45:50 2019]  exportfs_decode_fh+0x49/0x290
[Sun Mar 31 02:45:50 2019]  ? exp_find_key+0x6a/0xa0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kmem_cache_alloc+0xff/0x510
[Sun Mar 31 02:45:50 2019]  ? exp_find+0x5b/0xb0 [nfsd]
[Sun Mar 31 02:45:50 2019]  fh_verify+0x2a6/0x5e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  nfsd_open+0x2c/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_write+0x4e/0x1e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd3_proc_write+0x8d/0x110 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_dispatch+0xb1/0x240 [nfsd]
[Sun Mar 31 02:45:50 2019]  svc_process_common+0x395/0x6e0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  svc_process+0xf6/0x1a0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  nfsd+0xe3/0x150 [nfsd]
[Sun Mar 31 02:45:50 2019]  kthread+0xff/0x130
[Sun Mar 31 02:45:50 2019]  ? nfsd_destroy+0x60/0x60 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kthread_create_on_node+0x40/0x40
[Sun Mar 31 02:45:50 2019]  ret_from_fork+0x35/0x40
[Sun Mar 31 02:45:50 2019] INFO: task nfsd:2847 blocked for more than 120 seconds.
[Sun Mar 31 02:45:50 2019]       Tainted: P           O    4.14.97.hrtdev #1
[Sun Mar 31 02:45:50 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 31 02:45:50 2019] nfsd            D    0  2847      2 0x80000000
[Sun Mar 31 02:45:50 2019] Call Trace:
[Sun Mar 31 02:45:50 2019]  ? __schedule+0x3b4/0x7e0
[Sun Mar 31 02:45:50 2019]  ? packet_rcv+0x41/0x430
[Sun Mar 31 02:45:50 2019]  schedule+0x32/0x80
[Sun Mar 31 02:45:50 2019]  schedule_preempt_disabled+0xa/0x10
[Sun Mar 31 02:45:50 2019]  __mutex_lock.isra.2+0x279/0x4a0
[Sun Mar 31 02:45:50 2019]  ? spl_kmem_cache_alloc+0xfe/0x860 [spl]
[Sun Mar 31 02:45:50 2019]  ? sch_direct_xmit+0x62/0x1a0
[Sun Mar 31 02:45:50 2019]  ? zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  ? bond_start_xmit+0x168/0x3d0 [bonding]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_zget+0xfb/0x220 [zfs]
[Sun Mar 31 02:45:50 2019]  ? __dev_queue_xmit+0x283/0x630
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_vget+0x1fc/0x390 [zfs]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zpl_fh_to_dentry+0x58/0xa0 [zfs]
[Sun Mar 31 02:45:50 2019]  exportfs_decode_fh+0x49/0x290
[Sun Mar 31 02:45:50 2019]  ? exp_find_key+0x6a/0xa0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? tcp_v4_do_rcv+0x114/0x1e0
[Sun Mar 31 02:45:50 2019]  ? __release_sock+0x91/0xd0
[Sun Mar 31 02:45:50 2019]  ? kmem_cache_alloc+0xff/0x510
[Sun Mar 31 02:45:50 2019]  ? exp_find+0x5b/0xb0 [nfsd]
[Sun Mar 31 02:45:50 2019]  fh_verify+0x2a6/0x5e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  nfsd_open+0x2c/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_write+0x4e/0x1e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd3_proc_write+0x8d/0x110 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_dispatch+0xb1/0x240 [nfsd]
[Sun Mar 31 02:45:50 2019]  svc_process_common+0x395/0x6e0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  svc_process+0xf6/0x1a0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  nfsd+0xe3/0x150 [nfsd]
[Sun Mar 31 02:45:50 2019]  kthread+0xff/0x130
[Sun Mar 31 02:45:50 2019]  ? nfsd_destroy+0x60/0x60 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kthread_create_on_node+0x40/0x40
[Sun Mar 31 02:45:50 2019]  ret_from_fork+0x35/0x40
[Sun Mar 31 02:45:50 2019] INFO: task nfsd:2848 blocked for more than 120 seconds.
[Sun Mar 31 02:45:50 2019]       Tainted: P           O    4.14.97.hrtdev #1
[Sun Mar 31 02:45:50 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 31 02:45:50 2019] nfsd            D    0  2848      2 0x80000000
[Sun Mar 31 02:45:50 2019] Call Trace:
[Sun Mar 31 02:45:50 2019]  ? __schedule+0x3b4/0x7e0
[Sun Mar 31 02:45:50 2019]  ? packet_rcv+0x41/0x430
[Sun Mar 31 02:45:50 2019]  schedule+0x32/0x80
[Sun Mar 31 02:45:50 2019]  schedule_preempt_disabled+0xa/0x10
[Sun Mar 31 02:45:50 2019]  __mutex_lock.isra.2+0x279/0x4a0
[Sun Mar 31 02:45:50 2019]  ? spl_kmem_cache_alloc+0xfe/0x860 [spl]
[Sun Mar 31 02:45:50 2019]  ? sch_direct_xmit+0x62/0x1a0
[Sun Mar 31 02:45:50 2019]  ? zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  ? bond_start_xmit+0x168/0x3d0 [bonding]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_zget+0xfb/0x220 [zfs]
[Sun Mar 31 02:45:50 2019]  ? __dev_queue_xmit+0x283/0x630
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_vget+0x1fc/0x390 [zfs]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zpl_fh_to_dentry+0x58/0xa0 [zfs]
[Sun Mar 31 02:45:50 2019]  exportfs_decode_fh+0x49/0x290
[Sun Mar 31 02:45:50 2019]  ? exp_find_key+0x6a/0xa0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kmem_cache_alloc+0xff/0x510
[Sun Mar 31 02:45:50 2019]  ? exp_find+0x5b/0xb0 [nfsd]
[Sun Mar 31 02:45:50 2019]  fh_verify+0x2a6/0x5e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_open+0x2c/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_write+0x4e/0x1e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd3_proc_write+0x8d/0x110 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_dispatch+0xb1/0x240 [nfsd]
[Sun Mar 31 02:45:50 2019]  svc_process_common+0x395/0x6e0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  svc_process+0xf6/0x1a0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  nfsd+0xe3/0x150 [nfsd]
[Sun Mar 31 02:45:50 2019]  kthread+0xff/0x130
[Sun Mar 31 02:45:50 2019]  ? nfsd_destroy+0x60/0x60 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kthread_create_on_node+0x40/0x40
[Sun Mar 31 02:45:50 2019]  ret_from_fork+0x35/0x40
[Sun Mar 31 02:45:50 2019] INFO: task nfsd:2849 blocked for more than 120 seconds.
[Sun Mar 31 02:45:50 2019]       Tainted: P           O    4.14.97.hrtdev #1
[Sun Mar 31 02:45:50 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 31 02:45:50 2019] nfsd            D    0  2849      2 0x80000000
[Sun Mar 31 02:45:50 2019] Call Trace:
[Sun Mar 31 02:45:50 2019]  ? __schedule+0x3b4/0x7e0
[Sun Mar 31 02:45:50 2019]  ? packet_rcv+0x41/0x430
[Sun Mar 31 02:45:50 2019]  schedule+0x32/0x80
[Sun Mar 31 02:45:50 2019]  schedule_preempt_disabled+0xa/0x10
[Sun Mar 31 02:45:50 2019]  __mutex_lock.isra.2+0x279/0x4a0
[Sun Mar 31 02:45:50 2019]  ? spl_kmem_cache_alloc+0xfe/0x860 [spl]
[Sun Mar 31 02:45:50 2019]  ? sch_direct_xmit+0x62/0x1a0
[Sun Mar 31 02:45:50 2019]  ? zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  ? bond_start_xmit+0x168/0x3d0 [bonding]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_zget+0xfb/0x220 [zfs]
[Sun Mar 31 02:45:50 2019]  ? __dev_queue_xmit+0x283/0x630
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_vget+0x1fc/0x390 [zfs]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zpl_fh_to_dentry+0x58/0xa0 [zfs]
[Sun Mar 31 02:45:50 2019]  exportfs_decode_fh+0x49/0x290
[Sun Mar 31 02:45:50 2019]  ? exp_find_key+0x6a/0xa0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? tcp_v4_do_rcv+0x114/0x1e0
[Sun Mar 31 02:45:50 2019]  ? __release_sock+0x91/0xd0
[Sun Mar 31 02:45:50 2019]  ? kmem_cache_alloc+0xff/0x510
[Sun Mar 31 02:45:50 2019]  ? exp_find+0x5b/0xb0 [nfsd]
[Sun Mar 31 02:45:50 2019]  fh_verify+0x2a6/0x5e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  nfsd_open+0x2c/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_write+0x4e/0x1e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd3_proc_write+0x8d/0x110 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_dispatch+0xb1/0x240 [nfsd]
[Sun Mar 31 02:45:50 2019]  svc_process_common+0x395/0x6e0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  svc_process+0xf6/0x1a0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  nfsd+0xe3/0x150 [nfsd]
[Sun Mar 31 02:45:50 2019]  kthread+0xff/0x130
[Sun Mar 31 02:45:50 2019]  ? nfsd_destroy+0x60/0x60 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kthread_create_on_node+0x40/0x40
[Sun Mar 31 02:45:50 2019]  ret_from_fork+0x35/0x40
[Sun Mar 31 02:45:50 2019] INFO: task nfsd:2850 blocked for more than 120 seconds.
[Sun Mar 31 02:45:50 2019]       Tainted: P           O    4.14.97.hrtdev #1
[Sun Mar 31 02:45:50 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 31 02:45:50 2019] nfsd            D    0  2850      2 0x80000000
[Sun Mar 31 02:45:50 2019] Call Trace:
[Sun Mar 31 02:45:50 2019]  ? __schedule+0x3b4/0x7e0
[Sun Mar 31 02:45:50 2019]  ? packet_rcv+0x41/0x430
[Sun Mar 31 02:45:50 2019]  schedule+0x32/0x80
[Sun Mar 31 02:45:50 2019]  schedule_preempt_disabled+0xa/0x10
[Sun Mar 31 02:45:50 2019]  __mutex_lock.isra.2+0x279/0x4a0
[Sun Mar 31 02:45:50 2019]  ? spl_kmem_cache_alloc+0xfe/0x860 [spl]
[Sun Mar 31 02:45:50 2019]  ? sch_direct_xmit+0x62/0x1a0
[Sun Mar 31 02:45:50 2019]  ? zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  ? bond_start_xmit+0x168/0x3d0 [bonding]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_zget+0xfb/0x220 [zfs]
[Sun Mar 31 02:45:50 2019]  ? __dev_queue_xmit+0x283/0x630
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_vget+0x1fc/0x390 [zfs]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zpl_fh_to_dentry+0x58/0xa0 [zfs]
[Sun Mar 31 02:45:50 2019]  exportfs_decode_fh+0x49/0x290
[Sun Mar 31 02:45:50 2019]  ? exp_find_key+0x6a/0xa0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? tcp_v4_do_rcv+0x114/0x1e0
[Sun Mar 31 02:45:50 2019]  ? __release_sock+0x91/0xd0
[Sun Mar 31 02:45:50 2019]  ? kmem_cache_alloc+0xff/0x510
[Sun Mar 31 02:45:50 2019]  ? exp_find+0x5b/0xb0 [nfsd]
[Sun Mar 31 02:45:50 2019]  fh_verify+0x2a6/0x5e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  nfsd_open+0x2c/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_write+0x4e/0x1e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd3_proc_write+0x8d/0x110 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_dispatch+0xb1/0x240 [nfsd]
[Sun Mar 31 02:45:50 2019]  svc_process_common+0x395/0x6e0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  svc_process+0xf6/0x1a0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  nfsd+0xe3/0x150 [nfsd]
[Sun Mar 31 02:45:50 2019]  kthread+0xff/0x130
[Sun Mar 31 02:45:50 2019]  ? nfsd_destroy+0x60/0x60 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kthread_create_on_node+0x40/0x40
[Sun Mar 31 02:45:50 2019]  ret_from_fork+0x35/0x40
[Sun Mar 31 02:45:50 2019] INFO: task nfsd:2851 blocked for more than 120 seconds.
[Sun Mar 31 02:45:50 2019]       Tainted: P           O    4.14.97.hrtdev #1
[Sun Mar 31 02:45:50 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 31 02:45:50 2019] nfsd            D    0  2851      2 0x80000000
[Sun Mar 31 02:45:50 2019] Call Trace:
[Sun Mar 31 02:45:50 2019]  ? __schedule+0x3b4/0x7e0
[Sun Mar 31 02:45:50 2019]  ? packet_rcv+0x41/0x430
[Sun Mar 31 02:45:50 2019]  schedule+0x32/0x80
[Sun Mar 31 02:45:50 2019]  schedule_preempt_disabled+0xa/0x10
[Sun Mar 31 02:45:50 2019]  __mutex_lock.isra.2+0x279/0x4a0
[Sun Mar 31 02:45:50 2019]  ? spl_kmem_cache_alloc+0xfe/0x860 [spl]
[Sun Mar 31 02:45:50 2019]  ? sch_direct_xmit+0x62/0x1a0
[Sun Mar 31 02:45:50 2019]  ? zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  ? bond_start_xmit+0x168/0x3d0 [bonding]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_zget+0xfb/0x220 [zfs]
[Sun Mar 31 02:45:50 2019]  ? __dev_queue_xmit+0x283/0x630
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_vget+0x1fc/0x390 [zfs]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zpl_fh_to_dentry+0x58/0xa0 [zfs]
[Sun Mar 31 02:45:50 2019]  exportfs_decode_fh+0x49/0x290
[Sun Mar 31 02:45:50 2019]  ? exp_find_key+0x6a/0xa0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? tcp_v4_do_rcv+0x114/0x1e0
[Sun Mar 31 02:45:50 2019]  ? __release_sock+0x91/0xd0
[Sun Mar 31 02:45:50 2019]  ? kmem_cache_alloc+0xff/0x510
[Sun Mar 31 02:45:50 2019]  ? exp_find+0x5b/0xb0 [nfsd]
[Sun Mar 31 02:45:50 2019]  fh_verify+0x2a6/0x5e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  nfsd_open+0x2c/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_write+0x4e/0x1e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd3_proc_write+0x8d/0x110 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_dispatch+0xb1/0x240 [nfsd]
[Sun Mar 31 02:45:50 2019]  svc_process_common+0x395/0x6e0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  svc_process+0xf6/0x1a0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  nfsd+0xe3/0x150 [nfsd]
[Sun Mar 31 02:45:50 2019]  kthread+0xff/0x130
[Sun Mar 31 02:45:50 2019]  ? nfsd_destroy+0x60/0x60 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kthread_create_on_node+0x40/0x40
[Sun Mar 31 02:45:50 2019]  ret_from_fork+0x35/0x40
[Sun Mar 31 02:45:50 2019] INFO: task nfsd:2852 blocked for more than 120 seconds.
[Sun Mar 31 02:45:50 2019]       Tainted: P           O    4.14.97.hrtdev #1
[Sun Mar 31 02:45:50 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 31 02:45:50 2019] nfsd            D    0  2852      2 0x80000000
[Sun Mar 31 02:45:50 2019] Call Trace:
[Sun Mar 31 02:45:50 2019]  ? __schedule+0x3b4/0x7e0
[Sun Mar 31 02:45:50 2019]  ? packet_rcv+0x41/0x430
[Sun Mar 31 02:45:50 2019]  schedule+0x32/0x80
[Sun Mar 31 02:45:50 2019]  schedule_preempt_disabled+0xa/0x10
[Sun Mar 31 02:45:50 2019]  __mutex_lock.isra.2+0x279/0x4a0
[Sun Mar 31 02:45:50 2019]  ? spl_kmem_cache_alloc+0xfe/0x860 [spl]
[Sun Mar 31 02:45:50 2019]  ? sch_direct_xmit+0x62/0x1a0
[Sun Mar 31 02:45:50 2019]  ? zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  ? bond_start_xmit+0x168/0x3d0 [bonding]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_zget+0xfb/0x220 [zfs]
[Sun Mar 31 02:45:50 2019]  ? __dev_queue_xmit+0x283/0x630
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_vget+0x1fc/0x390 [zfs]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zpl_fh_to_dentry+0x58/0xa0 [zfs]
[Sun Mar 31 02:45:50 2019]  exportfs_decode_fh+0x49/0x290
[Sun Mar 31 02:45:50 2019]  ? exp_find_key+0x6a/0xa0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kmem_cache_alloc+0xff/0x510
[Sun Mar 31 02:45:50 2019]  ? exp_find+0x5b/0xb0 [nfsd]
[Sun Mar 31 02:45:50 2019]  fh_verify+0x2a6/0x5e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  nfsd_open+0x2c/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_write+0x4e/0x1e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd3_proc_write+0x8d/0x110 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_dispatch+0xb1/0x240 [nfsd]
[Sun Mar 31 02:45:50 2019]  svc_process_common+0x395/0x6e0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  svc_process+0xf6/0x1a0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  nfsd+0xe3/0x150 [nfsd]
[Sun Mar 31 02:45:50 2019]  kthread+0xff/0x130
[Sun Mar 31 02:45:50 2019]  ? nfsd_destroy+0x60/0x60 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kthread_create_on_node+0x40/0x40
[Sun Mar 31 02:45:50 2019]  ret_from_fork+0x35/0x40
[Sun Mar 31 02:45:50 2019] INFO: task nfsd:2853 blocked for more than 120 seconds.
[Sun Mar 31 02:45:50 2019]       Tainted: P           O    4.14.97.hrtdev #1
[Sun Mar 31 02:45:50 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 31 02:45:50 2019] nfsd            D    0  2853      2 0x80000000
[Sun Mar 31 02:45:50 2019] Call Trace:
[Sun Mar 31 02:45:50 2019]  ? __schedule+0x3b4/0x7e0
[Sun Mar 31 02:45:50 2019]  ? packet_rcv+0x41/0x430
[Sun Mar 31 02:45:50 2019]  schedule+0x32/0x80
[Sun Mar 31 02:45:50 2019]  schedule_preempt_disabled+0xa/0x10
[Sun Mar 31 02:45:50 2019]  __mutex_lock.isra.2+0x279/0x4a0
[Sun Mar 31 02:45:50 2019]  ? spl_kmem_cache_alloc+0xfe/0x860 [spl]
[Sun Mar 31 02:45:50 2019]  ? sch_direct_xmit+0x62/0x1a0
[Sun Mar 31 02:45:50 2019]  ? zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  ? bond_start_xmit+0x168/0x3d0 [bonding]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_zget+0xfb/0x220 [zfs]
[Sun Mar 31 02:45:50 2019]  ? __dev_queue_xmit+0x283/0x630
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_vget+0x1fc/0x390 [zfs]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zpl_fh_to_dentry+0x58/0xa0 [zfs]
[Sun Mar 31 02:45:50 2019]  exportfs_decode_fh+0x49/0x290
[Sun Mar 31 02:45:50 2019]  ? exp_find_key+0x6a/0xa0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? tcp_v4_do_rcv+0x114/0x1e0
[Sun Mar 31 02:45:50 2019]  ? __release_sock+0x91/0xd0
[Sun Mar 31 02:45:50 2019]  ? kmem_cache_alloc+0xff/0x510
[Sun Mar 31 02:45:50 2019]  ? exp_find+0x5b/0xb0 [nfsd]
[Sun Mar 31 02:45:50 2019]  fh_verify+0x2a6/0x5e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  nfsd_open+0x2c/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_write+0x4e/0x1e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd3_proc_write+0x8d/0x110 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_dispatch+0xb1/0x240 [nfsd]
[Sun Mar 31 02:45:50 2019]  svc_process_common+0x395/0x6e0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  svc_process+0xf6/0x1a0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  nfsd+0xe3/0x150 [nfsd]
[Sun Mar 31 02:45:50 2019]  kthread+0xff/0x130
[Sun Mar 31 02:45:50 2019]  ? nfsd_destroy+0x60/0x60 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kthread_create_on_node+0x40/0x40
[Sun Mar 31 02:45:50 2019]  ret_from_fork+0x35/0x40
[Sun Mar 31 02:45:50 2019] INFO: task nfsd:2854 blocked for more than 120 seconds.
[Sun Mar 31 02:45:50 2019]       Tainted: P           O    4.14.97.hrtdev #1
[Sun Mar 31 02:45:50 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 31 02:45:50 2019] nfsd            D    0  2854      2 0x80000000
[Sun Mar 31 02:45:50 2019] Call Trace:
[Sun Mar 31 02:45:50 2019]  ? __schedule+0x3b4/0x7e0
[Sun Mar 31 02:45:50 2019]  schedule+0x32/0x80
[Sun Mar 31 02:45:50 2019]  rwsem_down_read_failed+0xfb/0x140
[Sun Mar 31 02:45:50 2019]  ? call_rwsem_down_read_failed+0x14/0x30
[Sun Mar 31 02:45:50 2019]  ? spl_kmem_cache_alloc+0xfe/0x860 [spl]
[Sun Mar 31 02:45:50 2019]  call_rwsem_down_read_failed+0x14/0x30
[Sun Mar 31 02:45:50 2019]  down_read+0x13/0x30
[Sun Mar 31 02:45:50 2019]  dnode_hold_impl+0xc9/0xc50 [zfs]
[Sun Mar 31 02:45:50 2019]  ? zfs_znode_hold_enter+0x107/0x150 [zfs]
[Sun Mar 31 02:45:50 2019]  dmu_bonus_hold+0x22/0x60 [zfs]
[Sun Mar 31 02:45:50 2019]  zfs_zget+0x111/0x220 [zfs]
[Sun Mar 31 02:45:50 2019]  ? __dev_queue_xmit+0x283/0x630
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zfs_vget+0x1fc/0x390 [zfs]
[Sun Mar 31 02:45:50 2019]  ? nfsd_proc_setattr+0x180/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  zpl_fh_to_dentry+0x58/0xa0 [zfs]
[Sun Mar 31 02:45:50 2019]  exportfs_decode_fh+0x49/0x290
[Sun Mar 31 02:45:50 2019]  ? exp_find_key+0x6a/0xa0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kmem_cache_alloc+0xff/0x510
[Sun Mar 31 02:45:50 2019]  ? exp_find+0x5b/0xb0 [nfsd]
[Sun Mar 31 02:45:50 2019]  fh_verify+0x2a6/0x5e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? getboottime64+0x22/0x30
[Sun Mar 31 02:45:50 2019]  nfsd_open+0x2c/0x180 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_write+0x4e/0x1e0 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd3_proc_write+0x8d/0x110 [nfsd]
[Sun Mar 31 02:45:50 2019]  nfsd_dispatch+0xb1/0x240 [nfsd]
[Sun Mar 31 02:45:50 2019]  svc_process_common+0x395/0x6e0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  svc_process+0xf6/0x1a0 [sunrpc]
[Sun Mar 31 02:45:50 2019]  nfsd+0xe3/0x150 [nfsd]
[Sun Mar 31 02:45:50 2019]  kthread+0xff/0x130
[Sun Mar 31 02:45:50 2019]  ? nfsd_destroy+0x60/0x60 [nfsd]
[Sun Mar 31 02:45:50 2019]  ? kthread_create_on_node+0x40/0x40
[Sun Mar 31 02:45:50 2019]  ret_from_fork+0x35/0x40

And some of the zpool events:

Mar 31 2019 09:01:53.420546561 ereport.fs.zfs.deadman
        class = "ereport.fs.zfs.deadman"
        ena = 0xd2ea468c63300001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x4ddb8a5d6c54d028
        (end detector)
        pool = "zpool"
        pool_guid = 0x4ddb8a5d6c54d028
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        zio_err = 0x0
        zio_flags = 0x0
        zio_stage = 0x10
        zio_pipeline = 0x17980f4
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_objset = 0x108
        zio_object = 0x0
        zio_level = 0x1
        zio_blkid = 0x39b
        time = 0x5ca0ba41 0x19110801
        eid = 0xe5a

Mar 31 2019 09:01:53.420546561 ereport.fs.zfs.deadman
        class = "ereport.fs.zfs.deadman"
        ena = 0xd2ea46963c200001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x4ddb8a5d6c54d028
        (end detector)
        pool = "zpool"
        pool_guid = 0x4ddb8a5d6c54d028
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        zio_err = 0x0
        zio_flags = 0x0
        zio_stage = 0x10
        zio_pipeline = 0x17980f4
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_objset = 0x108
        zio_object = 0x0
        zio_level = 0x1
        zio_blkid = 0x39a
        time = 0x5ca0ba41 0x19110801
        eid = 0xe5b

Mar 31 2019 09:01:53.420546561 ereport.fs.zfs.deadman
        class = "ereport.fs.zfs.deadman"
        ena = 0xd2ea46a263100001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x4ddb8a5d6c54d028
        (end detector)
        pool = "zpool"
        pool_guid = 0x4ddb8a5d6c54d028
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        zio_err = 0x0
        zio_flags = 0x0
        zio_stage = 0x10
        zio_pipeline = 0x17980f4
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_objset = 0x108
        zio_object = 0x0
        zio_level = 0x1
        zio_blkid = 0x399
        time = 0x5ca0ba41 0x19110801
        eid = 0xe5c

Mar 31 2019 09:01:53.420546561 ereport.fs.zfs.deadman
        class = "ereport.fs.zfs.deadman"
        ena = 0xd2ea46b2e6f00001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x4ddb8a5d6c54d028
        (end detector)
        pool = "zpool"
        pool_guid = 0x4ddb8a5d6c54d028
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        zio_err = 0x0
        zio_flags = 0x0
        zio_stage = 0x10
        zio_pipeline = 0x17980f4
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_objset = 0x108
        zio_object = 0x0
        zio_level = 0x1
        zio_blkid = 0x398
        time = 0x5ca0ba41 0x19110801
        eid = 0xe5d

Mar 31 2019 09:01:53.420546561 ereport.fs.zfs.deadman
        class = "ereport.fs.zfs.deadman"
        ena = 0xd2ea46ba21900001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x4ddb8a5d6c54d028
        (end detector)
        pool = "zpool"
        pool_guid = 0x4ddb8a5d6c54d028
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        zio_err = 0x0
        zio_flags = 0x0
        zio_stage = 0x10
        zio_pipeline = 0x17980f4
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_objset = 0x108
        zio_object = 0x0
        zio_level = 0x1
        zio_blkid = 0x347
        time = 0x5ca0ba41 0x19110801
        eid = 0xe5e

I suspect there have been zpool events since 2:48, when ZFS got stuck, but 'zpool events' seems to just show the last few.

txg_sync, a few z_wr_iss and all nfsd are the processes in D state.

Here's all the ZFS tuning:

# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Sun Mar 31 09:05:32 2019
Linux 4.14.97.hrtdev                            0.8.0-rc3_126_g5dbf8b4ed
Machine: kea2 (x86_64)                          0.8.0-rc3_126_g5dbf8b4ed

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    91.1 %  411.5 GiB
        Target size (adaptive):                        91.1 %  411.5 GiB
        Min size (hard limit):                         55.6 %  251.0 GiB
        Max size (high water):                            1:1  451.9 GiB
        Most Frequently Used (MFU) cache size:         28.5 %  116.9 GiB
        Most Recently Used (MRU) cache size:           71.5 %  292.5 GiB
        Metadata cache size (hard limit):              75.0 %  338.9 GiB
        Metadata cache size (current):                  1.7 %    5.9 GiB
        Dnode cache size (hard limit):                 10.0 %   33.9 GiB
        Dnode cache size (current):                     1.9 %  662.6 MiB

ARC hash breakdown:
        Elements max:                                               7.1M
        Elements current:                              47.5 %       3.4M
        Collisions:                                                22.1M
        Chain max:                                                     5
        Chains:                                                    81.5k

ARC misc:
        Deleted:                                                   74.5M
        Mutex misses:                                               9.5M
        Eviction skips:                                             1.9M

ARC total accesses (hits + misses):                                70.3G
        Cache hit ratio:                               99.8 %      70.2G
        Cache miss ratio:                               0.2 %     129.8M
        Actual hit ratio (MFU + MRU hits):             99.5 %      70.0G
        Data demand efficiency:                        99.1 %       1.3G
        Data prefetch efficiency:                         n/a          0

Cache hits by cache type:
        Most frequently used (MFU):                    94.8 %      66.6G
        Most recently used (MRU):                       4.9 %       3.4G
        Most frequently used (MFU) ghost:               0.1 %      55.9M
        Most recently used (MRU) ghost:                 0.1 %      62.8M
        Anonymously used:                               0.1 %      71.4M

Cache hits by data type:
        Demand data:                                    1.9 %       1.3G
        Demand prefetch data:                           0.0 %          0
        Demand metadata:                               97.8 %      68.6G
        Demand prefetch metadata:                       0.4 %     265.6M

Cache misses by data type:
        Demand data:                                    9.2 %      11.9M
        Demand prefetch data:                           0.0 %          0
        Demand metadata:                               79.4 %     103.0M
        Demand prefetch metadata:                      11.4 %      14.8M

DMU prefetch efficiency:                                               0
        Hit ratio:                                        n/a          0
        Miss ratio:                                       n/a          0

L2ARC not detected, skipping section

Solaris Porting Layer (SPL):
        spl_hostid                                                     0
        spl_hostid_path                                      /etc/hostid
        spl_kmem_alloc_max                                       1048576
        spl_kmem_alloc_warn                                        65536
        spl_kmem_cache_expire                                          2
        spl_kmem_cache_kmem_limit                                   2048
        spl_kmem_cache_kmem_threads                                    4
        spl_kmem_cache_magazine_size                                   0
        spl_kmem_cache_max_size                                       32
        spl_kmem_cache_obj_per_slab                                    8
        spl_kmem_cache_obj_per_slab_min                                1
        spl_kmem_cache_reclaim                                         0
        spl_kmem_cache_slab_limit                                  16384
        spl_max_show_tasks                                           512
        spl_panic_halt                                                 0
        spl_taskq_kick                                                 0
        spl_taskq_thread_bind                                          0
        spl_taskq_thread_dynamic                                       1
        spl_taskq_thread_priority                                      1
        spl_taskq_thread_sequential                                    4

Tunables:
        dbuf_cache_hiwater_pct                                        10
        dbuf_cache_lowater_pct                                        10
        dbuf_cache_max_bytes                                  8423723520
        dbuf_cache_shift                                               5
        dbuf_metadata_cache_max_bytes                         4211861760
        dbuf_metadata_cache_shift                                      6
        dmu_object_alloc_chunk_shift                                   7
        ignore_hole_birth                                              1
        l2arc_feed_again                                               1
        l2arc_feed_min_ms                                            200
        l2arc_feed_secs                                                1
        l2arc_headroom                                                 2
        l2arc_headroom_boost                                         200
        l2arc_noprefetch                                               1
        l2arc_norw                                                     0
        l2arc_write_boost                                        8388608
        l2arc_write_max                                          8388608
        metaslab_aliquot                                          524288
        metaslab_bias_enabled                                          1
        metaslab_debug_load                                            0
        metaslab_debug_unload                                          0
        metaslab_force_ganging                                  16777217
        metaslab_fragmentation_factor_enabled                          1
        metaslab_lba_weighting_enabled                                 1
        metaslab_preload_enabled                                       1
        send_holes_without_birth_time                                  1
        spa_asize_inflation                                           24
        spa_config_path                             /etc/zfs/zpool.cache
        spa_load_print_vdev_tree                                       0
        spa_load_verify_data                                           1
        spa_load_verify_maxinflight                                10000
        spa_load_verify_metadata                                       1
        spa_slop_shift                                                 5
        vdev_removal_max_span                                      32768
        vdev_validate_skip                                             0
        zfetch_array_rd_sz                                       1048576
        zfetch_max_distance                                      8388608
        zfetch_max_streams                                             8
        zfetch_min_sec_reap                                            2
        zfs_abd_scatter_enabled                                        1
        zfs_abd_scatter_max_order                                     10
        zfs_abd_scatter_min_size                                    1536
        zfs_admin_snapshot                                             0
        zfs_arc_average_blocksize                                   8192
        zfs_arc_dnode_limit                                            0
        zfs_arc_dnode_limit_percent                                   10
        zfs_arc_dnode_reduce_percent                                  10
        zfs_arc_grow_retry                                             0
        zfs_arc_lotsfree_percent                                      10
        zfs_arc_max                                         485206474752
        zfs_arc_meta_adjust_restarts                                4096
        zfs_arc_meta_limit                                             0
        zfs_arc_meta_limit_percent                                    75
        zfs_arc_meta_min                                               0
        zfs_arc_meta_prune                                         10000
        zfs_arc_meta_strategy                                          1
        zfs_arc_min                                         269559152640
        zfs_arc_min_prefetch_ms                                        0
        zfs_arc_min_prescient_prefetch_ms                              0
        zfs_arc_p_dampener_disable                                     1
        zfs_arc_p_min_shift                                            0
        zfs_arc_pc_percent                                             0
        zfs_arc_shrink_shift                                           0
        zfs_arc_sys_free                                               0
        zfs_async_block_max_blocks                                100000
        zfs_autoimport_disable                                         1
        zfs_checksum_events_per_second                                20
        zfs_commit_timeout_pct                                         5
        zfs_compressed_arc_enabled                                     1
        zfs_condense_indirect_commit_entry_delay_ms                    0
        zfs_condense_indirect_vdevs_enable                             1
        zfs_condense_max_obsolete_bytes                       1073741824
        zfs_condense_min_mapping_bytes                            131072
        zfs_dbgmsg_enable                                              1
        zfs_dbgmsg_maxsize                                       4194304
        zfs_dbuf_state_index                                           0
        zfs_ddt_data_is_special                                        1
        zfs_deadman_checktime_ms                                   60000
        zfs_deadman_enabled                                            1
        zfs_deadman_failmode                                    continue
        zfs_deadman_synctime_ms                                   600000
        zfs_deadman_ziotime_ms                                    300000
        zfs_dedup_prefetch                                             0
        zfs_delay_min_dirty_percent                                   60
        zfs_delay_scale                                           500000
        zfs_delete_blocks                                          20480
        zfs_dirty_data_max                                    4294967296
        zfs_dirty_data_max_max                                4294967296
        zfs_dirty_data_max_max_percent                                25
        zfs_dirty_data_max_percent                                    10
        zfs_dirty_data_sync_percent                                   20
        zfs_disable_ivset_guid_check                                   0
        zfs_dmu_offset_next_sync                                       0
        zfs_expire_snapshot                                          300
        zfs_flags                                                      0
        zfs_free_bpobj_enabled                                         1
        zfs_free_leak_on_eio                                           0
        zfs_free_min_time_ms                                        1000
        zfs_immediate_write_sz                                     32768
        zfs_initialize_value                        16045690984833335022
        zfs_key_max_salt_uses                                  400000000
        zfs_lua_max_instrlimit                                 100000000
        zfs_lua_max_memlimit                                   104857600
        zfs_max_missing_tvds                                           0
        zfs_max_recordsize                                       1048576
        zfs_metaslab_fragmentation_threshold                          70
        zfs_metaslab_segment_weight_enabled                            1
        zfs_metaslab_switch_threshold                                  2
        zfs_mg_fragmentation_threshold                                85
        zfs_mg_noalloc_threshold                                       0
        zfs_multihost_fail_intervals                                  10
        zfs_multihost_history                                          0
        zfs_multihost_import_intervals                                20
        zfs_multihost_interval                                      2000
        zfs_multilist_num_sublists                                     0
        zfs_no_scrub_io                                                0
        zfs_no_scrub_prefetch                                          0
        zfs_nocacheflush                                               0
        zfs_nopwrite_enabled                                           1
        zfs_object_mutex_size                                         64
        zfs_obsolete_min_time_ms                                     500
        zfs_override_estimate_recordsize                               0
        zfs_pd_bytes_max                                        52428800
        zfs_per_txg_dirty_frees_percent                                5
        zfs_prefetch_disable                                           1
        zfs_read_chunk_size                                      1048576
        zfs_read_history                                               0
        zfs_read_history_hits                                          0
        zfs_reconstruct_indirect_combinations_max                   4096
        zfs_recover                                                    0
        zfs_recv_queue_length                                   16777216
        zfs_removal_ignore_errors                                      0
        zfs_removal_suspend_progress                                   0
        zfs_remove_max_segment                                  16777216
        zfs_resilver_disable_defer                                     0
        zfs_resilver_min_time_ms                                    3000
        zfs_scan_checkpoint_intval                                  7200
        zfs_scan_fill_weight                                           3
        zfs_scan_ignore_errors                                         0
        zfs_scan_issue_strategy                                        0
        zfs_scan_legacy                                                0
        zfs_scan_max_ext_gap                                     2097152
        zfs_scan_mem_lim_fact                                         20
        zfs_scan_mem_lim_soft_fact                                    20
        zfs_scan_strict_mem_lim                                        0
        zfs_scan_suspend_progress                                      0
        zfs_scan_vdev_limit                                      4194304
        zfs_scrub_min_time_ms                                       1000
        zfs_send_corrupt_data                                          0
        zfs_send_queue_length                                   16777216
        zfs_slow_io_events_per_second                                 20
        zfs_spa_discard_memory_limit                            16777216
        zfs_sync_pass_deferred_free                                    2
        zfs_sync_pass_dont_compress                                    5
        zfs_sync_pass_rewrite                                          2
        zfs_sync_taskq_batch_pct                                      75
        zfs_txg_history                                              100
        zfs_txg_timeout                                                5
        zfs_unlink_suspend_progress                                    0
        zfs_user_indirect_is_special                                   1
        zfs_vdev_aggregation_limit                              16777216
        zfs_vdev_aggregation_limit_non_rotating                   131072
        zfs_vdev_async_read_max_active                                 3
        zfs_vdev_async_read_min_active                                 1
        zfs_vdev_async_write_active_max_dirty_percent                 60
        zfs_vdev_async_write_active_min_dirty_percent                 30
        zfs_vdev_async_write_max_active                               10
        zfs_vdev_async_write_min_active                                2
        zfs_vdev_cache_bshift                                         16
        zfs_vdev_cache_max                                         16384
        zfs_vdev_cache_size                                            0
        zfs_vdev_default_ms_count                                    200
        zfs_vdev_initializing_max_active                               1
        zfs_vdev_initializing_min_active                               1
        zfs_vdev_max_active                                         1000
        zfs_vdev_min_ms_count                                         16
        zfs_vdev_mirror_non_rotating_inc                               0
        zfs_vdev_mirror_non_rotating_seek_inc                          1
        zfs_vdev_mirror_rotating_inc                                   0
        zfs_vdev_mirror_rotating_seek_inc                              5
        zfs_vdev_mirror_rotating_seek_offset                     1048576
        zfs_vdev_ms_count_limit                                   131072
        zfs_vdev_queue_depth_pct                                    1000
        zfs_vdev_raidz_impl[fastest] original scalar sse2 ssse3 avx2 avx512f avx512bw
        zfs_vdev_read_gap_limit                                    32768
        zfs_vdev_removal_max_active                                    2
        zfs_vdev_removal_min_active                                    1
        zfs_vdev_scheduler                                      deadline
        zfs_vdev_scrub_max_active                                      2
        zfs_vdev_scrub_min_active                                      1
        zfs_vdev_sync_read_max_active                                 10
        zfs_vdev_sync_read_min_active                                 10
        zfs_vdev_sync_write_max_active                                10
        zfs_vdev_sync_write_min_active                                10
        zfs_vdev_write_gap_limit                                    4096
        zfs_zevent_cols                                               80
        zfs_zevent_console                                             0
        zfs_zevent_len_max                                          1152
        zfs_zil_clean_taskq_maxalloc                             1048576
        zfs_zil_clean_taskq_minalloc                                1024
        zfs_zil_clean_taskq_nthr_pct                                 100
        zil_nocacheflush                                               0
        zil_replay_disable                                             0
        zil_slog_bulk                                             786432
        zio_deadman_log_all                                            1
        zio_dva_throttle_enabled                                       1
        zio_requeue_io_start_cut_in_line                               1
        zio_slow_io_ms                                             30000
        zio_taskq_batch_pct                                           75
        zvol_inhibit_dev                                               0
        zvol_major                                                   230
        zvol_max_discard_blocks                                    16384
        zvol_prefetch_bytes                                       131072
        zvol_request_sync                                              0
        zvol_threads                                                  32
        zvol_volmode                                                   1

VDEV cache disabled, skipping section

ZIL committed transactions:                                            0
        Commit requests:                                               0
        Flushes to stable storage:                                     0
        Transactions to SLOG storage pool:            0 Bytes          0
        Transactions to non-SLOG storage pool:        0 Bytes          0

dweeezil commented 5 years ago

@gerardba Please post the contents of /proc/spl/kstat/zfs/dbgmsg which will show a more detailed dump of the entire zio tree. The lines of interest begin with "slow zio[...". Also please check the output of "ps ax" to see whether any ZFS-related tasks are stuck in "R" state. Finally, post the contents of /proc/spl/taskq if it is not empty.

gerardba commented 5 years ago

The only thing I can confirm at this time is that there were no ZFS-related tasks stuck in "R" state, the only process in "R" state on the system was [rcu_sched]. I had rebooted already when I saw the message, I've restarted the workload and will provide that data when it gets stuck.

gerardba commented 5 years ago

This issue happened again and this time I could collect all the data @dweeezil , I'm attaching a compressed text file with the output of the following commands:

Running (0) - cat /sys/module/zfs/version
Running (1) - dstat 1 2
Running (2) - free
Running (3) - uptime
Running (4) - uname -a
Running (5) - cat /etc/issue
Running (6) - zpool status
Running (7) - zfs list
Running (10) - cat /proc/spl/kstat/zfs/dbgmsg
Running (11) - cat /proc/spl/taskq
Running (12) - ps faux
Running (13) - iostat -mx 1 2

debuginfo.txt.gz

Hopefully this helps find the culprit. For now I've updated the system to 0.8.1 so if this happens again it will be on a newer version of ZFS....

swirly commented 4 years ago

I think we are hit by the same kind of problem Brand new dell r740xd server with H740p perc card in eHBA mode. 10 disk on ZFS. hangs sometimes, IO appearing in virtual machines and a message about zvol getting stuck for more than 120s (kernel tainted)

We thought it was a zfs utils version problem, but we are at last version (0.8.1)

Updated our proxmox kernel from 5.0.16 to 5.0.21, hoping for a driver problem :(

camaer commented 4 years ago

We thought it was a zfs utils version problem, but we are at last version (0.8.1)

The latest version on Proxmox is 0.8.2 so I recommend updating

gerardba commented 4 years ago

This issue has not happened again. What I believe made the difference was upgrading the NVME drives' firmware to the latest available, after that the system has been stable for over 10 months. The Intel devices I am using had a bug fixed in a FW upgrade which made the drive 'eat' some commands, never returning a reply for them.

openzfs / zfs