Closed Harvie closed 5 years ago
@Harvie are you able to consistently reproduce this issue or is it intermittent? When it next occurs you're going to want to log in to the system and run dmesg
to dump the console logs. If you can attach them to this issue we can take a look and may be able to determine why the zfs recv
is hung.
A similar hang was reported in #8174 which ended up being caused by an issue with jumbo frames. You may want to double check everything is in order with your network.
It's intermittent.
When it next occurs you're going to want to log in to the system and run dmesg to dump the console logs.
I cannot log to the system, since it's completely frozen. Does not respond to keyboard, does not ping.
You may want to double check everything is in order with your network.
Are you telling me that when packets get corrupted (or probably just lost) on their way through network, this can lead to kernel lockup/panic by design? what? why? can't it just gracefuly timeout or something?
having exactly the same issue. Just upgraded from ubuntu 1604 to 1804 and problem started, host stable until we do zfs recv - then box actuall has a bsod - one time we got same message as above. very reproducable - host wont stay up longer than 10minutes if we run zfs receive. all working ok on the 1604 nodes we have, also not using Jumbo frames. ran a long memtest on server and ok, also zpool scrub fine, also smartctl long test ok..no core dumps or anything in messages - we do see memory use shooting up prior to the bsod. any ideas gratefully received.
I've upgraded to zfs 0.8 from git master and so far no crash. But can't really say yet. Before it was running with 0.7.13 for few hours as well...
I've also upgraded production server, which is under heavier io load to 0.8.0-rc3_126_g5dbf8b4ed and i get crash each hour or two... Similar to what i was geting with 0.7.13...
I've found there is enormous number of reported problems about ZFS locking up under io load across whole internet in recent 10 years. It seem to affect wide spectrum of ZFS versions. Including old 0.6.x, stable 0.7.x and 0.8.x rc. Some of the posts suggested that i should use kernel with CONFIG_PREEMPT_VOLUNTARY=y
, but i've checked and i do use it.
It's strange that this happens suddenly without any performance degradation. Everything runs smooth and suddenly it goes to kernel lockup and panic because of ZFS. I have 70G RAM and use 32G zfs_arc_max. 24 cpu threads.
More io load there is, the more likely it's to crash. (during peak hours the server crashes more frequently, during night only one or two times). Unfortunately the error messages are different each time.
I am pulling my hair for months with this one...
Another crash:
[21087.917164] INFO: rcu_sched self-detected stall on CPU
[21087.917167] INFO: rcu_sched self-detected stall on CPU
[21087.917177] 11-....: (14999 ticks this GP) idle=72a/140000000000001/0 softirq=1937756/1937756 fqs=7429
[21087.917182] 16-....: (14999 ticks this GP) idle=662/140000000000001/0 softirq=1793101/1793101 fqs=7429
[21087.917183]
[21087.917185] (t=15000 jiffies g=1338404 c=1338403 q=333215)
[21087.917191] (t=15000 jiffies g=1338404 c=1338403 q=333215)
[21087.917194] Sending NMI from CPU 16 to CPUs 2:
[21087.917201] NMI backtrace for cpu 2
[21087.917204] CPU: 2 PID: 808997 Comm: fail2ban-server Tainted: P IO 4.15.18-11-pve #1
[21087.917205] Hardware name: HP ProLiant DL160 G6 , BIOS O33 07/01/2013
[21087.917211] RIP: 0010:smp_call_function_many+0x1f9/0x260
[21087.917211] RSP: 0000:ffffa1a10b117b18 EFLAGS: 00000202
[21087.917213] RAX: 0000000000000009 RBX: ffff8bbd1baa3900 RCX: ffff8bc61b2e89c0
[21087.917214] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8bbd1b444828
[21087.917215] RBP: ffffa1a10b117b50 R08: fffffffffffffe00 R09: 0000000000fffffb
[21087.917215] R10: ffffedf71b4c5440 R11: ffff8bbb3d049ff4 R12: ffffffffb767b810
[21087.917216] R13: ffffa1a10b117b88 R14: 0000000000000001 R15: 0000000000000018
[21087.917218] FS: 00007f31d80d7700(0000) GS:ffff8bbd1ba80000(0000) knlGS:0000000000000000
[21087.917218] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[21087.917219] CR2: 00007f31d6d52d34 CR3: 0000000f1d5a6006 CR4: 00000000000206e0
[21087.917220] Call Trace:
[21087.917225] ? x86_configure_nx+0x50/0x50
[21087.917227] on_each_cpu+0x2d/0x60
[21087.917229] flush_tlb_kernel_range+0x79/0x80
[21087.917231] unmap_kernel_range+0x26/0x30
[21087.917235] zs_unmap_object+0x9d/0xd0
[21087.917240] zram_bvec_rw.constprop.32+0x4f0/0x930 [zram]
[21087.917243] zram_rw_page+0x9e/0xf3 [zram]
[21087.917245] bdev_read_page+0x8f/0xc0
[21087.917247] swap_readpage+0x54/0x1e0
[21087.917249] do_swap_page+0x599/0x960
[21087.917250] ? do_wp_page+0x10b/0x4f0
[21087.917251] __handle_mm_fault+0x847/0x11c0
[21087.917253] handle_mm_fault+0xce/0x1b0
[21087.917256] __do_page_fault+0x25e/0x500
[21087.917258] ? syscall_trace_enter+0xca/0x2e0
[21087.917260] do_page_fault+0x2e/0xe0
[21087.917263] ? page_fault+0x2f/0x50
[21087.917264] page_fault+0x45/0x50
[21087.917266] RIP: 0033:0x55af839a005d
[21087.917266] RSP: 002b:00007ffef2cb2d40 EFLAGS: 00010202
[21087.917268] RAX: 000055af83ebfd50 RBX: 00007f31d6d1e2f0 RCX: 00007f31d434c0d0
[21087.917268] RDX: 0000000000000000 RSI: 00007f31d6d256c8 RDI: 00007f31d6d52d20
[21087.917269] RBP: 00007f31d434c0b8 R08: 0000000000000001 R09: 0000000000000000
[21087.917270] R10: 000055af83ea3ca0 R11: 000055af83ea3c40 R12: 000055af84a29600
[21087.917271] R13: 0000000000000001 R14: 0000000000000000 R15: 00007f31d6d1e2f0
[21087.917272] Code: 89 c7 e8 fb 6e 88 00 3b 05 49 f1 53 01 0f 83 8f fe ff ff 48 63 d0 48 8b 0b 48 03 0c d5 c0 86 7a b8 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c7 0f b6 4d cc 4c 89 ea 4c 89 e6 44
[21087.918197] Sending NMI from CPU 16 to CPUs 9:
[21087.918203] NMI backtrace for cpu 9
[21087.918207] CPU: 9 PID: 807556 Comm: opendkim Tainted: P IO 4.15.18-11-pve #1
[21087.918208] Hardware name: HP ProLiant DL160 G6 , BIOS O33 07/01/2013
[21087.918215] RIP: 0010:zram_slot_free_notify+0x54/0x70 [zram]
[21087.918216] RSP: 0000:ffffa1a109503c58 EFLAGS: 00000006
[21087.918218] RAX: ffffa1a13efc1cb0 RBX: 0000000006758cb0 RCX: ffffa1a13efc1cb8
[21087.918219] RDX: 0000000002000438 RSI: 00000000006758cb RDI: ffff8bbcf393b740
[21087.918220] RBP: ffffa1a109503c68 R08: ffff8bbc7128a480 R09: 0000000000000000
[21087.918221] R10: 0000000000000040 R11: ffff8bbc7128a4a8 R12: ffff8bbcf43a5a00
[21087.918222] R13: ffffffffc04aea70 R14: 00000000006758cb R15: ffff8bbcea040000
[21087.918224] FS: 00007fc510581700(0000) GS:ffff8bc61b2c0000(0000) knlGS:0000000000000000
[21087.918226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[21087.918227] CR2: 00007fc51783f205 CR3: 000000113aed4006 CR4: 00000000000206e0
[21087.918228] Call Trace:
[21087.918235] swap_range_free+0xa1/0xf0
[21087.918238] swapcache_free_entries+0xf7/0x1f0
[21087.918243] free_swap_slot+0xd6/0xf0
[21087.918245] swap_free+0x36/0x40
[21087.918247] do_swap_page+0x71e/0x960
[21087.918250] __handle_mm_fault+0x847/0x11c0
[21087.918253] ? do_futex+0x2db/0xb70
[21087.918256] handle_mm_fault+0xce/0x1b0
[21087.918259] __do_page_fault+0x25e/0x500
[21087.918261] ? SyS_futex+0x83/0x180
[21087.918263] do_page_fault+0x2e/0xe0
[21087.918266] ? page_fault+0x2f/0x50
[21087.918268] page_fault+0x45/0x50
[21087.918270] RIP: 0033:0x7fc5176343a3
[21087.918271] RSP: 002b:00007fc510580aa0 EFLAGS: 00010246
[21087.918272] RAX: 00007fc51783eef0 RBX: 00007fc51783eef0 RCX: 0000000000000073
[21087.918273] RDX: 0000000000000000 RSI: 00007fc51783ee88 RDI: 00007fc514b7d268
[21087.918275] RBP: 0000000000000004 R08: 00007fc510580bf0 R09: 00007fc517850428
[21087.918276] R10: 00007fc51783b890 R11: 0000000000000206 R12: 0000000000000023
[21087.918277] R13: 0000000000000000 R14: 00007fc517835c10 R15: 0000000000000000
[21087.918278] Code: ba 68 08 19 72 1f 4c 89 e7 e8 f9 f4 ff ff 49 03 1c 24 f0 80 63 0b fd f0 49 ff 84 24 80 00 00 00 5b 41 5c 5d c3 48 8d 48 08 f3 90 <48> 8b 50 08 f7 c2 00 00 00 02 75 f2 f0 0f ba 29 19 73 c8 eb e9
[21087.919198] Sending NMI from CPU 16 to CPUs 11:
[21087.919205] NMI backtrace for cpu 11
[21087.919208] CPU: 11 PID: 807555 Comm: opendkim Tainted: P IO 4.15.18-11-pve #1
[21087.919209] Hardware name: HP ProLiant DL160 G6 , BIOS O33 07/01/2013
[21087.919214] RIP: 0010:native_queued_spin_lock_slowpath+0x177/0x1a0
[21087.919215] RSP: 0000:ffff8bc61b343df8 EFLAGS: 00000002
[21087.919217] RAX: 0000000000200101 RBX: 0000000000000083 RCX: 0000000000000001
[21087.919218] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffffffffb8aa9400
[21087.919219] RBP: ffff8bc61b343df8 R08: 0000000000000101 R09: 00000000000030b8
[21087.919220] R10: 0140000000000001 R11: 00000000000030b8 R12: ffffffffb8c5fee0
[21087.919222] R13: ffffffffb8aa9240 R14: 0000000000000000 R15: ffffffffb8c5fee0
[21087.919223] FS: 00007fc510d82700(0000) GS:ffff8bc61b340000(0000) knlGS:0000000000000000
[21087.919225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[21087.919226] CR2: 00007fc517835c10 CR3: 000000113aed4004 CR4: 00000000000206e0
[21087.919227] Call Trace:
[21087.919228] <IRQ>
[21087.919231] _raw_spin_lock_irqsave+0x37/0x40
[21087.919235] rcu_dump_cpu_stacks+0x50/0xd6
[21087.919237] rcu_check_callbacks+0x7c4/0x920
[21087.919239] ? tick_sched_do_timer+0x50/0x50
[21087.919243] update_process_times+0x2f/0x60
[21087.919245] tick_sched_handle+0x2a/0x60
[21087.919247] tick_sched_timer+0x39/0x80
[21087.919248] __hrtimer_run_queues+0xe7/0x220
[21087.919250] hrtimer_interrupt+0xa3/0x1e0
[21087.919253] smp_apic_timer_interrupt+0x6f/0x130
[21087.919255] apic_timer_interrupt+0x84/0x90
[21087.919255] </IRQ>
[21087.919258] RIP: 0010:smp_call_function_many+0x1f9/0x260
[21087.919259] RSP: 0000:ffffa1a10844fb18 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
[21087.919260] RAX: 0000000000000009 RBX: ffff8bc61b363900 RCX: ffff8bc61b2e8d60
[21087.919261] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8bc61ac01cc0
[21087.919262] RBP: ffffa1a10844fb50 R08: fffffffffffffe00 R09: 0000000000fff7ff
[21087.919264] R10: ffffedf71b276780 R11: ffff8bc5af8dbff4 R12: ffffffffb767b810
[21087.919265] R13: ffffa1a10844fb88 R14: 0000000000000001 R15: 0000000000000018
[21087.919267] ? x86_configure_nx+0x50/0x50
[21087.919270] ? smp_call_function_many+0x1d5/0x260
[21087.919272] ? x86_configure_nx+0x50/0x50
[21087.919274] on_each_cpu+0x2d/0x60
[21087.919276] flush_tlb_kernel_range+0x79/0x80
[21087.919278] unmap_kernel_range+0x26/0x30
[21087.919282] zs_unmap_object+0x9d/0xd0
[21087.919286] zram_bvec_rw.constprop.32+0x4f0/0x930 [zram]
[21087.919290] zram_rw_page+0x9e/0xf3 [zram]
[21087.919292] bdev_read_page+0x8f/0xc0
[21087.919294] swap_readpage+0x54/0x1e0
[21087.919296] do_swap_page+0x599/0x960
[21087.919298] __handle_mm_fault+0x847/0x11c0
[21087.919300] ? do_futex+0x934/0xb70
[21087.919303] handle_mm_fault+0xce/0x1b0
[21087.919305] __do_page_fault+0x25e/0x500
[21087.919307] ? SyS_futex+0x83/0x180
[21087.919309] do_page_fault+0x2e/0xe0
[21087.919311] ? page_fault+0x2f/0x50
[21087.919313] page_fault+0x45/0x50
[21087.919314] RIP: 0033:0x7fc517634380
[21087.919315] RSP: 002b:00007fc510d81aa0 EFLAGS: 00010246
[21087.919317] RAX: 00007fc510d81b18 RBX: 00007fc51783bbe8 RCX: 00007fc514b7a218
[21087.919318] RDX: 00007fc510d81be0 RSI: 0000000031125a5c RDI: 00007fc514b7c0b9
[21087.919319] RBP: 0000000000000000 R08: 00007fc510d81bf0 R09: 00007fc517850428
[21087.919320] R10: 00007fc51783b890 R11: 0000000000000283 R12: 0000000000000023
[21087.919321] R13: 0000000000000000 R14: 00007fc517835c10 R15: 0000000000000000
[21087.919323] Code: 31 c0 41 39 c0 74 e6 4d 85 c9 c6 07 01 74 30 41 c7 41 08 01 00 00 00 e9 52 ff ff ff 83 fa 01 0f 84 b0 fe ff ff 8b 07 84 c0 74 08 <f3> 90 8b 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 f3 90 4c
[21087.920200] NMI backtrace for cpu 16
[21087.920206] CPU: 16 PID: 809401 Comm: apache2 Tainted: P IO 4.15.18-11-pve #1
[21087.920207] Hardware name: HP ProLiant DL160 G6 , BIOS O33 07/01/2013
[21087.920208] Call Trace:
[21087.920209] <IRQ>
[21087.920213] dump_stack+0x63/0x8b
[21087.920217] nmi_cpu_backtrace+0xca/0xd0
[21087.920222] ? lapic_can_unplug_cpu+0xb0/0xb0
[21087.920224] nmi_trigger_cpumask_backtrace+0xe6/0x130
[21087.920227] arch_trigger_cpumask_backtrace+0x19/0x20
[21087.920230] rcu_dump_cpu_stacks+0xa3/0xd6
[21087.920233] rcu_check_callbacks+0x7c4/0x920
[21087.920236] ? tick_sched_do_timer+0x50/0x50
[21087.920240] update_process_times+0x2f/0x60
[21087.920242] tick_sched_handle+0x2a/0x60
[21087.920245] tick_sched_timer+0x39/0x80
[21087.920247] __hrtimer_run_queues+0xe7/0x220
[21087.920249] hrtimer_interrupt+0xa3/0x1e0
[21087.920252] smp_apic_timer_interrupt+0x6f/0x130
[21087.920254] apic_timer_interrupt+0x84/0x90
[21087.920256] </IRQ>
[21087.920258] RIP: 0010:smp_call_function_many+0x1f9/0x260
[21087.920260] RSP: 0000:ffffa1a10b3bba58 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
[21087.920262] RAX: 0000000000000009 RBX: ffff8bbd1bca3900 RCX: ffff8bc61b2e8e00
[21087.920263] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8bbd1b444c28
[21087.920265] RBP: ffffa1a10b3bba90 R08: fffffffffffffe00 R09: 0000000000feffff
[21087.920266] R10: ffffedf742c3af80 R11: ffff8bbb84d74ff4 R12: ffffffffb767b810
[21087.920267] R13: ffffa1a10b3bbac8 R14: 0000000000000001 R15: 0000000000000018
[21087.920270] ? x86_configure_nx+0x50/0x50
[21087.920274] ? smp_call_function_many+0x1d5/0x260
[21087.920276] ? x86_configure_nx+0x50/0x50
[21087.920278] on_each_cpu+0x2d/0x60
[21087.920281] flush_tlb_kernel_range+0x79/0x80
[21087.920283] unmap_kernel_range+0x26/0x30
[21087.920286] zs_unmap_object+0x9d/0xd0
[21087.920290] zram_bvec_rw.constprop.32+0x4f0/0x930 [zram]
[21087.920294] zram_rw_page+0x9e/0xf3 [zram]
[21087.920296] bdev_read_page+0x8f/0xc0
[21087.920298] swap_readpage+0x54/0x1e0
[21087.920301] swapin_readahead+0x198/0x2b0
[21087.920304] do_swap_page+0x4f9/0x960
[21087.920306] ? do_swap_page+0x4f9/0x960
[21087.920308] ? do_wp_page+0x10b/0x4f0
[21087.920310] __handle_mm_fault+0x847/0x11c0
[21087.920313] handle_mm_fault+0xce/0x1b0
[21087.920316] __do_page_fault+0x25e/0x500
[21087.920319] do_page_fault+0x2e/0xe0
[21087.920322] ? page_fault+0x2f/0x50
[21087.920324] page_fault+0x45/0x50
[21087.920326] RIP: 0033:0x7f1560821bca
[21087.920327] RSP: 002b:00007ffc0e476230 EFLAGS: 00010206
[21087.920329] RAX: 00007f15610f1030 RBX: 00007f15610f50a0 RCX: 0000000000000000
[21087.920330] RDX: 00007f15610bd028 RSI: 0000000000000000 RDI: 0000000000000001
[21087.920332] RBP: 00007f1560f62028 R08: 00007f15610f50a8 R09: 0000000000000000
[21087.920333] R10: 00007f1561063000 R11: 0000000000000000 R12: 0000564ae84540b3
[21087.920335] R13: 00007f15610ef028 R14: 0000000000000000 R15: 00000000fffffffe
[21087.920342] Sending NMI from CPU 11 to CPUs 16:
[21087.920347] NMI backtrace for cpu 16
[21087.920349] CPU: 16 PID: 809401 Comm: apache2 Tainted: P IO 4.15.18-11-pve #1
[21087.920350] Hardware name: HP ProLiant DL160 G6 , BIOS O33 07/01/2013
[21087.920352] RIP: 0010:cpumask_next_wrap+0x47/0x70
[21087.920353] RSP: 0000:ffff8bbd1bc83b88 EFLAGS: 00000016
[21087.920355] RAX: 000000000000000e RBX: 0000000000000001 RCX: 000000000000000e
[21087.920356] RDX: 000000000000000e RSI: 0000000000000000 RDI: ffff8bbd17812b38
[21087.920358] RBP: ffff8bbd1bc83bb0 R08: ffffffffffffc000 R09: 000000000003f03f
[21087.920359] R10: ffff8bbd1bc83b90 R11: 0000000000000092 R12: 0000000000000004
[21087.920360] R13: ffff8bbd17812b38 R14: 00000000ffffffff R15: 000000000000000d
[21087.920361] FS: 00007f15610e4040(0000) GS:ffff8bbd1bc80000(0000) knlGS:0000000000000000
[21087.920363] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[21087.920364] CR2: 00007f15610bd040 CR3: 0000000740792002 CR4: 00000000000206e0
[21087.920364] Call Trace:
[21087.920365] <IRQ>
[21087.920369] select_idle_sibling+0x35c/0x420
[21087.920372] select_task_rq_fair+0x8c5/0xb20
[21087.920374] ? __module_text_address+0x12/0x60
[21087.920378] ? kernel_text_address+0xdc/0xf0
[21087.920380] ? page_fault+0x45/0x50
[21087.920382] ? __kernel_text_address+0x12/0x40
[21087.920384] ? page_fault+0x45/0x50
[21087.920388] try_to_wake_up+0x10c/0x4c0
[21087.920391] default_wake_function+0x12/0x20
[21087.920394] autoremove_wake_function+0x12/0x40
[21087.920396] __wake_up_common+0x8d/0x140
[21087.920398] __wake_up_common_lock+0x80/0xc0
[21087.920401] __wake_up+0x13/0x20
[21087.920403] wake_up_klogd_work_func+0x40/0x60
[21087.920406] irq_work_run_list+0x55/0x80
[21087.920408] ? tick_sched_do_timer+0x50/0x50
[21087.920409] irq_work_tick+0x40/0x50
[21087.920412] update_process_times+0x42/0x60
[21087.920414] tick_sched_handle+0x2a/0x60
[21087.920416] tick_sched_timer+0x39/0x80
[21087.920418] __hrtimer_run_queues+0xe7/0x220
[21087.920420] hrtimer_interrupt+0xa3/0x1e0
[21087.920422] smp_apic_timer_interrupt+0x6f/0x130
[21087.920424] apic_timer_interrupt+0x84/0x90
[21087.920425] </IRQ>
[21087.920428] RIP: 0010:smp_call_function_many+0x1f9/0x260
[21087.920429] RSP: 0000:ffffa1a10b3bba58 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
[21087.920431] RAX: 0000000000000009 RBX: ffff8bbd1bca3900 RCX: ffff8bc61b2e8e00
[21087.920432] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8bbd1b444c28
[21087.920433] RBP: ffffa1a10b3bba90 R08: fffffffffffffe00 R09: 0000000000feffff
[21087.920434] R10: ffffedf742c3af80 R11: ffff8bbb84d74ff4 R12: ffffffffb767b810
[21087.920435] R13: ffffa1a10b3bbac8 R14: 0000000000000001 R15: 0000000000000018
[21087.920438] ? x86_configure_nx+0x50/0x50
[21087.920441] ? smp_call_function_many+0x1d5/0x260
[21087.920442] ? x86_configure_nx+0x50/0x50
[21087.920445] on_each_cpu+0x2d/0x60
[21087.920447] flush_tlb_kernel_range+0x79/0x80
[21087.920449] unmap_kernel_range+0x26/0x30
[21087.920452] zs_unmap_object+0x9d/0xd0
[21087.920455] zram_bvec_rw.constprop.32+0x4f0/0x930 [zram]
[21087.920459] zram_rw_page+0x9e/0xf3 [zram]
[21087.920461] bdev_read_page+0x8f/0xc0
[21087.920464] swap_readpage+0x54/0x1e0
[21087.920466] swapin_readahead+0x198/0x2b0
[21087.920469] do_swap_page+0x4f9/0x960
[21087.920470] ? do_swap_page+0x4f9/0x960
[21087.920472] ? do_wp_page+0x10b/0x4f0
[21087.920474] __handle_mm_fault+0x847/0x11c0
[21087.920477] handle_mm_fault+0xce/0x1b0
[21087.920479] __do_page_fault+0x25e/0x500
[21087.920482] do_page_fault+0x2e/0xe0
[21087.920484] ? page_fault+0x2f/0x50
[21087.920486] page_fault+0x45/0x50
[21087.920488] RIP: 0033:0x7f1560821bca
[21087.920489] RSP: 002b:00007ffc0e476230 EFLAGS: 00010206
[21087.920491] RAX: 00007f15610f1030 RBX: 00007f15610f50a0 RCX: 0000000000000000
[21087.920492] RDX: 00007f15610bd028 RSI: 0000000000000000 RDI: 0000000000000001
[21087.920493] RBP: 00007f1560f62028 R08: 00007f15610f50a8 R09: 0000000000000000
[21087.920494] R10: 00007f1561063000 R11: 0000000000000000 R12: 0000564ae84540b3
[21087.920495] R13: 00007f15610ef028 R14: 0000000000000000 R15: 00000000fffffffe
[21087.920497] Code: cb 00 53 41 89 d4 89 cb eb 12 8b 35 78 81 cb 00 45 89 f7 bb 01 00 00 00 39 c6 77 26 41 8d 57 01 4c 89 ef 48 63 d2 e8 29 4c af ff <45> 39 e7 89 c1 0f 9c c2 84 da 74 d3 41 39 c4 7f ce 8b 0d 46 81
@Harvie thanks for posting the additional console logs. Were you able to log in to the machine after hitting the problem this time with the 0.8 version of ZFS? If so, are these the complete logs starting from the initial failure. I ask because nothing in the most recent logs you posted is related to ZFS. They do indicate there that may be a problem with the kernel provided zram driver.
OMG. You are right. I forgot to disable zram some time ago. Thanks for pointing this out. I was convinced that "zram" is part of zfs infrastructure (because of "z" leter :-) and didn't realized it. However i've experienced crashes on machines not running zram. But this might be big part of the problem. I will disable zram and do some testing...
Crashed again (0.8.0-rc3_126_g5dbf8b4ed)
I don't think this is caused by jumbo frame problem, since i am using zfs send/recv over plain old SSH. And i've never had ssh corupted my data.
What really drives me nuts is the fact, the machine does not reboot after freezing. I don't know why. I have it set to panic on hung task and reboot on panic, but it still does not reboot:
kernel.panic = 5
kernel.hung_task_panic = 1
kernel.hung_task_timeout_secs = 300
kernel.watchdog_thresh = 60
kernel.softlockup_panic = 1
kernel.hardlockup_panic = 1
#kernel.panic_on_warn = 0
#kernel.panic_on_oops = 0
I've just returned from datacenter and i've found it have crashed again...
because you didn't paste logs from any panic, it is just a blocked task. not much we can do without any logs..
Well the photo is all i have. Because the machine stops responding completely. No keyboard, no networking... I would happily provide you with some logs, but i am not very experienced with retrieving logs from crashed systems. So for now the photos are all i can get. Is there some easy way to get the logs?
What will happen if i set kernel.hung_task_timeout_secs = 0
? I guess the hung ZFS recv tasks will accumulate until something goes wrong. Also i will probably lose ability to autoreboot in case of some none-zfs task is stucked...
@Harvie thanks for the updated stack trace, this one definitely shows a panic in the zfs send/recv
code which needs to be investigated. You mentioned this was with 0.8.0-rc3_126_g5dbf8b4ed which is good to know, also when building did you by chance enable debugging by passing --enable-debug
to configure. If not, that would be the next step since it will enable additional correctness checks which may detect the core issue.
You can force the system to panic, rather then hang this one thread, by setting the spl.ko
module option spl_panic_halt=1
. The default behavior is to suspend the one thread since this normally will allow you to log in and inspect the system, though that seems not to be the case on your system.
Had two more crashes:
I've found that it does not even allow me to use sysrq when it freezes. However i managed to get dmesg before it freezed completely:
[ 8.623049] ZFS: Loaded module v0.8.0-rc3_126_g5dbf8b4ed, ZFS pool version 5000, ZFS filesystem version 5
[ 8.647634] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[ 9.151052] random: crng init done
[ 9.154558] random: 7 urandom warning(s) missed due to ratelimiting
[ 26.683471] zd16: p1 p2
[ 33.544883] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 33.544888] 8021q: adding VLAN 0 to HW filter on device eth0
[ 33.937170] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 34.265830] Bluetooth: Core ver 2.22
[ 34.265861] NET: Registered protocol family 31
[ 34.265861] Bluetooth: HCI device and connection manager initialized
[ 34.265869] Bluetooth: HCI socket layer initialized
[ 34.265873] Bluetooth: L2CAP socket layer initialized
[ 34.265884] Bluetooth: SCO socket layer initialized
[ 34.269490] Netfilter messages via NETLINK v0.30.
[ 36.396990] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 36.422043] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 54.654126] ip_set: protocol 6
[ 688.535774] perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[ 1076.082743] perf: interrupt took too long (3150 > 3141), lowering kernel.perf_event_max_sample_rate to 63500
[ 1096.157111] VERIFY3(0 == dnode_set_blksz(dn, blocksize, 0, tx)) failed (0 == 95)
[ 1096.172219] PANIC at dnode.c:715:dnode_reallocate()
[ 1096.180809] Showing stack for process 803
[ 1096.188362] CPU: 3 PID: 803 Comm: receive_writer Tainted: P O 4.15.18-12-pve #1
[ 1096.188362] Hardware name: /S3000AH, BIOS S3000.86B.02.00.0044.071120071047 07/11/2007
[ 1096.188364] Call Trace:
[ 1096.188377] dump_stack+0x63/0x8b
[ 1096.188402] spl_dumpstack+0x29/0x30 [spl]
[ 1096.188407] spl_panic+0xc8/0x110 [spl]
[ 1096.188410] ? mutex_lock+0x12/0x40
[ 1096.188412] ? _cond_resched+0x1a/0x50
[ 1096.188621] ? zfs_space_delta_cb+0xeb/0x230 [zfs]
[ 1096.188624] ? __kmalloc_node+0x206/0x2b0
[ 1096.188626] ? _cond_resched+0x1a/0x50
[ 1096.188627] ? mutex_lock+0x12/0x40
[ 1096.188665] ? dmu_objset_userquota_get_ids+0x26a/0x420 [zfs]
[ 1096.188667] ? _cond_resched+0x1a/0x50
[ 1096.188668] ? _cond_resched+0x1a/0x50
[ 1096.188669] ? mutex_lock+0x12/0x40
[ 1096.188710] ? dnode_set_blksz+0xb7/0x310 [zfs]
[ 1096.188714] ? spl_kmem_free+0x33/0x40 [spl]
[ 1096.188755] dnode_reallocate+0x24c/0x270 [zfs]
[ 1096.188793] dmu_object_reclaim_dnsize+0xca/0xf0 [zfs]
[ 1096.188831] receive_object+0x450/0x9f0 [zfs]
[ 1096.188836] ? __wake_up_common_lock+0x8e/0xc0
[ 1096.188838] ? __wake_up+0x13/0x20
[ 1096.188842] ? __cv_signal+0x2d/0x40 [spl]
[ 1096.188881] receive_writer_thread+0x420/0xaa0 [zfs]
[ 1096.188884] ? set_curr_task_fair+0x2b/0x60
[ 1096.188889] ? spl_kmem_free+0x33/0x40 [spl]
[ 1096.188928] ? receive_free.isra.9+0xc0/0xc0 [zfs]
[ 1096.188934] thread_generic_wrapper+0x74/0x90 [spl]
[ 1096.188973] ? receive_free.isra.9+0xc0/0xc0 [zfs]
[ 1096.188978] ? thread_generic_wrapper+0x74/0x90 [spl]
[ 1096.188982] kthread+0x105/0x140
[ 1096.188988] ? __thread_exit+0x20/0x20 [spl]
[ 1096.188989] ? kthread_create_worker_on_cpu+0x70/0x70
[ 1096.188992] ret_from_fork+0x35/0x40
[ 1278.557483] perf: interrupt took too long (3945 > 3937), lowering kernel.perf_event_max_sample_rate to 50500
did you by chance enable debugging by passing --enable-debug to configure.
Unfortunately not, but these two lines seem to be quite relevant:
[ 1096.157111] VERIFY3(0 == dnode_set_blksz(dn, blocksize, 0, tx)) failed (0 == 95)
[ 1096.172219] PANIC at dnode.c:715:dnode_reallocate()
ps aux
shows zfs recv
process in Ds
state (uninterruptible sleep (usually IO))
I think that in all such cases the zfs module should abort the operation and zfs recv
should return non-zero exit code.
I have bit more insight in what happens. I've set kernel.hung_task_timeout_secs = 0
so it didn't rebooted after the task is hung. But then when i later tried to reboot
the machine manualy, it has freezed completely instead of rebooting. So there's some secondary bug causing kernel to freeze completely if i try to reboot after this happens. Unfortunately this affects even automatic panic reboot provided by combination of kernel.hung_task_panic
, kernel.hung_task_timeout_secs
and kernel.panic
. Which means that machine will not recover without manual intervention.
When it fails to reboot itself, it repeatedly writes the following into the dmesg instead of rebooting:
...
[42587.314026] systemd-journald[282]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[42707.315498] systemd-journald[282]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[42767.316223] systemd-journald[282]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[42887.317665] systemd-journald[282]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[42947.318420] systemd-journald[282]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[43067.319910] systemd-journald[282]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
spl_panic_halt=1
seems to at least cause the machine to reboot on crash. That seems to be a good idea on production servers...
When i enable the panic halt and keep dmesg -w
running over ssh it at least manages to give me this before the server reboots:
[ 4778.548762] VERIFY3(0 == dnode_set_blksz(dn, blocksize, 0, tx)) failed (0 == 95)
[ 4778.565400] PANIC at dnode.c:715:dnode_reallocate()
[ 4778.574797] Kernel panic - not syncing: VERIFY3(0 == dnode_set_blksz(dn, blocksize, 0, tx)) failed (0 == 95)
I've noticed this problem only happens when replicating one of my 46 datasets. In middle of the replication. It's always the same dataset. Unfortunately i can't share it, since it contains customer's data. I've tried to run scrub in order to fix it, but it didn't helped...
I guess it might be possible to fix the issue by recreating the dataset on the sender side and rsyncing data to new one. But that way we'll not be able to debug the receiver issue.
When i think about it... Dataset which is causing the crash is different from others, because i've been recently increasing quota on it (after upgrading to 0.8.0), while on others i did not changed anything since upgrade...
Are you able to get any output after the crash? If so you can run this command before doing the replication:
echo 4294966270 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
echo 4294966270 > /sys/module/zfs/parameters/zfs_flags
and then after the panic run:
cat /proc/spl/kstat/zfs/dbgmsg
I have exactly the same problem under gentoo with latest ZOL git Version. "dnode_set_blksz" (dnode.c Line 715) the assert triggers a panic. I get this when doing a recursive zfs send on the pool - ALWAYS on the same dataset. Scrub doesnt show any errors. As the code says, this is only successful if the dnode doesnt have blocks allocated or dirty blocks beyond the first block. But the error handling is not very graceful.
echo 0 > /sys/module/spl/parameters/spl_panic_halt
echo 4294966270 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
echo 4294966270 > /sys/module/zfs/parameters/zfs_flags
These are just some of the errors in dbg. However it's not during the dataset that crashes.
1554246004 zap_micro.c:985:zap_lookup_impl(): error 2
1554246004 dsl_prop.c:147:dsl_prop_get_dd(): error 2
1554246004 zap_micro.c:985:zap_lookup_impl(): error 2
1554246004 zap_micro.c:985:zap_lookup_impl(): error 2
1554246004 zap_micro.c:985:zap_lookup_impl(): error 2
1554246004 dsl_prop.c:147:dsl_prop_get_dd(): error 2
1554246004 spa_history.c:314:spa_history_log_sync(): txg 297878 receive tank/backup/pve1/%recv (id 266085)
1554246004 zap.c:740:fzap_checksize(): error 22
1554246004 zfeature.c:239:feature_get_refcount(): error 95
1554246004 zfeature.c:239:feature_get_refcount(): error 95
1554246068 zap_micro.c:988:zap_lookup_impl(): error 75
1554246159 dmu.c:453:dmu_spill_hold_existing(): error 2
1554246159 sa.c:370:sa_attr_op(): error 2
1554246159 dbuf.c:2702:dbuf_findbp(): error 2
1554246159 dnode.c:1423:dnode_hold_impl(): error 2
1554246159 dbuf.c:3159:dbuf_hold_impl_arg(): error 2
I've randomly picked lines from the log, so they don't repeat the same errors. I can't do proper debug of the crashed dataset right now...
@harvie can you repeat the test she share the entire file after the crash? Most errors reported via this mechanism are normal and expected, by the way
@tcaputi These are the lines that showed up moments before and during crash:
It crashes on tank/backup/pve1/vps/subvol-212-disk-1
(Full log here: dbgmsg.txt ):
Also note there was scrub running at the time (but the crash is reproducible whether it runs or not)...
1554285854 spa_history.c:314:spa_history_log_sync(): txg 315926 destroy tank/backup/pve1/vps/subvol-210-disk-1/%recv (id 332567)
1554285854 bptree.c:232:bptree_iterate(): bptree index 0: traversing from min_txg=315916 bookmark 0/0/0/0
1554285854 dsl_scan.c:3281:dsl_process_async_destroys(): freed 12 blocks in 1ms from free_bpobj/bptree txg 315926; err=0
1554285855 dsl_scan.c:3612:dsl_scan_sync(): scan issued 740 blocks (1 segs) in 1041ms (avg_block_size = 126899, avg_seg_size = 95583744)
1554285857 dsl_scan.c:3612:dsl_scan_sync(): scan issued 754 blocks (1 segs) in 1044ms (avg_block_size = 123324, avg_seg_size = 93637120)
1554285858 spa_history.c:314:spa_history_log_sync(): txg 315928 receive tank/backup/pve1/vps/subvol-210-disk-1/%recv (id 332435)
1554285859 dsl_scan.c:3612:dsl_scan_sync(): scan issued 761 blocks (1 segs) in 1046ms (avg_block_size = 129905, avg_seg_size = 102903296)
1554285859 metaslab.c:2642:metaslab_condense(): condensing: txg 315928, msp[83] 000000007099e18c, vdev id 0, spa tank, smp size 20312, segments 758, forcing condense=FALSE
1554285861 dsl_scan.c:3612:dsl_scan_sync(): scan issued 6105 blocks (1 segs) in 1034ms (avg_block_size = 14670, avg_seg_size = 90400256)
1554285863 dsl_scan.c:3612:dsl_scan_sync(): scan issued 757 blocks (1 segs) in 1040ms (avg_block_size = 120745, avg_seg_size = 91611136)
1554285864 dsl_scan.c:3612:dsl_scan_sync(): scan issued 767 blocks (1 segs) in 1044ms (avg_block_size = 130029, avg_seg_size = 103000064)
1554285866 dsl_scan.c:3612:dsl_scan_sync(): scan issued 4145 blocks (1 segs) in 1040ms (avg_block_size = 24107, avg_seg_size = 100842496)
1554285867 dsl_scan.c:3612:dsl_scan_sync(): scan issued 810 blocks (1 segs) in 1044ms (avg_block_size = 123954, avg_seg_size = 101988864)
1554285869 dsl_scan.c:3612:dsl_scan_sync(): scan issued 777 blocks (1 segs) in 1044ms (avg_block_size = 122507, avg_seg_size = 97101312)
1554285869 metaslab.c:2642:metaslab_condense(): condensing: txg 315934, msp[84] 000000006602e64c, vdev id 0, spa tank, smp size 24960, segments 968, forcing condense=FALSE
1554285870 dsl_scan.c:3612:dsl_scan_sync(): scan issued 776 blocks (1 segs) in 1042ms (avg_block_size = 124644, avg_seg_size = 98706944)
1554285871 spa_history.c:314:spa_history_log_sync(): txg 315936 finish receiving tank/backup/pve1/vps/subvol-210-disk-1/%recv (id 332435) snap=2019-04-03-114500
1554285871 spa_history.c:314:spa_history_log_sync(): txg 315936 clone swap tank/backup/pve1/vps/subvol-210-disk-1/%recv (id 332435) parent=subvol-210-disk-1
1554285871 spa_history.c:314:spa_history_log_sync(): txg 315936 snapshot tank/backup/pve1/vps/subvol-210-disk-1@2019-04-03-114500 (id 332571)
1554285871 spa_history.c:314:spa_history_log_sync(): txg 315936 destroy tank/backup/pve1/vps/subvol-210-disk-1/%recv (id 332435)
1554285871 bptree.c:232:bptree_iterate(): bptree index 0: traversing from min_txg=315926 bookmark 0/0/0/0
1554285871 dsl_scan.c:3281:dsl_process_async_destroys(): freed 12 blocks in 1ms from free_bpobj/bptree txg 315936; err=0
1554285872 dsl_scan.c:3612:dsl_scan_sync(): scan issued 1695 blocks (1 segs) in 1038ms (avg_block_size = 58576, avg_seg_size = 101062656)
1554285874 dsl_scan.c:3612:dsl_scan_sync(): scan issued 797 blocks (1 segs) in 1034ms (avg_block_size = 121509, avg_seg_size = 99873792)
1554285874 spa_history.c:314:spa_history_log_sync(): txg 315938 receive tank/backup/pve1/vps/subvol-210-disk-1/%recv (id 332579)
1554285875 dsl_scan.c:3612:dsl_scan_sync(): scan issued 760 blocks (1 segs) in 1035ms (avg_block_size = 124114, avg_seg_size = 98268672)
1554285877 dsl_scan.c:3612:dsl_scan_sync(): scan issued 553 blocks (1 segs) in 1036ms (avg_block_size = 127394, avg_seg_size = 71434240)
1554285879 dsl_scan.c:3612:dsl_scan_sync(): scan issued 804 blocks (1 segs) in 1042ms (avg_block_size = 123357, avg_seg_size = 101583360)
1554285880 dsl_scan.c:3612:dsl_scan_sync(): scan issued 1628 blocks (1 segs) in 1036ms (avg_block_size = 58890, avg_seg_size = 98772992)
1554285881 dsl_scan.c:3612:dsl_scan_sync(): scan issued 741 blocks (1 segs) in 1039ms (avg_block_size = 125372, avg_seg_size = 95204864)
1554285883 dsl_scan.c:3612:dsl_scan_sync(): scan issued 787 blocks (1 segs) in 1043ms (avg_block_size = 126392, avg_seg_size = 100123136)
1554285885 dsl_scan.c:3612:dsl_scan_sync(): scan issued 1296 blocks (1 segs) in 1039ms (avg_block_size = 73784, avg_seg_size = 98725888)
1554285886 dsl_scan.c:3612:dsl_scan_sync(): scan issued 779 blocks (1 segs) in 1042ms (avg_block_size = 124187, avg_seg_size = 98285568)
1554285887 spa_history.c:314:spa_history_log_sync(): txg 315946 finish receiving tank/backup/pve1/vps/subvol-210-disk-1/%recv (id 332579) snap=2019-04-03-120000
1554285887 spa_history.c:314:spa_history_log_sync(): txg 315946 clone swap tank/backup/pve1/vps/subvol-210-disk-1/%recv (id 332579) parent=subvol-210-disk-1
1554285887 spa_history.c:314:spa_history_log_sync(): txg 315946 snapshot tank/backup/pve1/vps/subvol-210-disk-1@2019-04-03-120000 (id 331640)
1554285887 spa_history.c:314:spa_history_log_sync(): txg 315946 destroy tank/backup/pve1/vps/subvol-210-disk-1/%recv (id 332579)
1554285887 bptree.c:232:bptree_iterate(): bptree index 0: traversing from min_txg=315936 bookmark 0/0/0/0
1554285887 dsl_scan.c:3281:dsl_process_async_destroys(): freed 12 blocks in 1ms from free_bpobj/bptree txg 315946; err=0
1554285888 dsl_scan.c:3612:dsl_scan_sync(): scan issued 722 blocks (1 segs) in 1040ms (avg_block_size = 127121, avg_seg_size = 92277760)
1554285894 dsl_scan.c:3612:dsl_scan_sync(): scan issued 2033 blocks (1 segs) in 4501ms (avg_block_size = 118550, avg_seg_size = 242679808)
1554285894 spa_history.c:306:spa_history_log_sync(): command: zfs recv -F tank/backup/pve1/vps/subvol-210-disk-1
1554285894 spa_history.c:314:spa_history_log_sync(): txg 315948 receive tank/backup/pve1/vps/subvol-212-disk-1/%recv (id 331646)
1554285895 dsl_scan.c:3612:dsl_scan_sync(): scan issued 1756 blocks (1 segs) in 1041ms (avg_block_size = 41981, avg_seg_size = 76745216)
1554285901 dsl_scan.c:3612:dsl_scan_sync(): scan issued 1830 blocks (1 segs) in 5133ms (avg_block_size = 126237, avg_seg_size = 233222656)
1554285905 dsl_scan.c:3612:dsl_scan_sync(): scan issued 662 blocks (1 segs) in 2456ms (avg_block_size = 124505, avg_seg_size = 86342656)
1554285905 metaslab.c:2642:metaslab_condense(): condensing: txg 315950, msp[83] 000000007099e18c, vdev id 0, spa tank, smp size 20432, segments 755, forcing condense=FALSE
1554285911 dsl_scan.c:3612:dsl_scan_sync(): scan issued 298 blocks (1 segs) in 1134ms (avg_block_size = 111342, avg_seg_size = 36106752)
1554285920 dsl_scan.c:3612:dsl_scan_sync(): scan issued 254 blocks (1 segs) in 1114ms (avg_block_size = 122918, avg_seg_size = 32504832)
1554285926 dsl_scan.c:3612:dsl_scan_sync(): scan issued 1646 blocks (1 segs) in 2379ms (avg_block_size = 122091, avg_seg_size = 201345536)
This is the dmesg:
[ 3899.363064] VERIFY3(0 == dnode_set_blksz(dn, blocksize, 0, tx)) failed (0 == 95)
[ 3899.381400] PANIC at dnode.c:715:dnode_reallocate()
[ 3899.397196] Showing stack for process 19347
[ 3899.413695] CPU: 3 PID: 19347 Comm: receive_writer Tainted: P O 4.15.18-12-pve #1
[ 3899.434953] Hardware name: /S3000AH, BIOS S3000.86B.02.00.0044.071120071047 07/11/2007
[ 3899.454548] Call Trace:
[ 3899.468022] dump_stack+0x63/0x8b
[ 3899.482296] spl_dumpstack+0x29/0x30 [spl]
[ 3899.497294] spl_panic+0xc8/0x110 [spl]
[ 3899.511958] ? mutex_lock+0x12/0x40
[ 3899.526180] ? _cond_resched+0x1a/0x50
[ 3899.540715] ? zfs_space_delta_cb+0xeb/0x230 [zfs]
[ 3899.556166] ? __kmalloc_node+0x206/0x2b0
[ 3899.570767] ? _cond_resched+0x1a/0x50
[ 3899.585029] ? mutex_lock+0x12/0x40
[ 3899.599023] ? dmu_objset_userquota_get_ids+0x26a/0x420 [zfs]
[ 3899.615262] ? _cond_resched+0x1a/0x50
[ 3899.629446] ? _cond_resched+0x1a/0x50
[ 3899.643495] ? mutex_lock+0x12/0x40
[ 3899.657254] ? dnode_set_blksz+0xb7/0x310 [zfs]
[ 3899.671998] ? spl_kmem_free+0x33/0x40 [spl]
[ 3899.686465] dnode_reallocate+0x24c/0x270 [zfs]
[ 3899.701138] dmu_object_reclaim_dnsize+0xca/0xf0 [zfs]
[ 3899.716418] receive_object+0x450/0x9f0 [zfs]
[ 3899.730826] ? __wake_up_common_lock+0x8e/0xc0
[ 3899.745274] ? __wake_up+0x13/0x20
[ 3899.758576] ? __cv_signal+0x2d/0x40 [spl]
[ 3899.772573] receive_writer_thread+0x420/0xaa0 [zfs]
[ 3899.787419] ? set_curr_task_fair+0x2b/0x60
[ 3899.801453] ? set_user_nice.part.68+0x103/0x200
[ 3899.815944] ? receive_free.isra.9+0xc0/0xc0 [zfs]
[ 3899.830513] thread_generic_wrapper+0x74/0x90 [spl]
[ 3899.845151] ? receive_free.isra.9+0xc0/0xc0 [zfs]
[ 3899.859635] ? thread_generic_wrapper+0x74/0x90 [spl]
[ 3899.874374] kthread+0x105/0x140
[ 3899.887211] ? __thread_exit+0x20/0x20 [spl]
[ 3899.901059] ? kthread_create_worker_on_cpu+0x70/0x70
[ 3899.915654] ret_from_fork+0x35/0x40
I got the same with latest git
[48816.810205] VERIFY3(0 == dnode_set_blksz(dn, blocksize, 0, tx)) failed (0 == 95) [48816.810212] PANIC at dnode.c:715:dnode_reallocate() [48816.810214] Showing stack for process 100084 [48816.810218] CPU: 2 PID: 100084 Comm: receive_writer Tainted: P O 4.15.18-12-pve #1 [48816.810221] Hardware name: GIGABYTE GB-BXi5-5200/MQLP5AP-00, BIOS F5 11/23/2015 [48816.810224] Call Trace: [48816.810231] dump_stack+0x63/0x8b [48816.810239] spl_dumpstack+0x29/0x30 [spl] [48816.810244] spl_panic+0xc8/0x110 [spl] [48816.810248] ? mutex_lock+0x12/0x40 [48816.810251] ? _cond_resched+0x1a/0x50 [48816.810309] ? zfs_space_delta_cb+0xeb/0x230 [zfs] [48816.810316] ? kmalloc_node+0x1f4/0x2b0 [48816.810321] ? _cond_resched+0x1a/0x50 [48816.810326] ? mutex_lock+0x12/0x40 [48816.810371] ? dmu_objset_userquota_get_ids+0x26a/0x420 [zfs] [48816.810375] ? _cond_resched+0x1a/0x50 [48816.810377] ? _cond_resched+0x1a/0x50 [48816.810380] ? mutex_lock+0x12/0x40 [48816.810426] ? dnode_set_blksz+0xb7/0x310 [zfs] [48816.810433] ? spl_kmem_free+0x33/0x40 [spl] [48816.810468] dnode_reallocate+0x24c/0x270 [zfs] [48816.810501] dmu_object_reclaim_dnsize+0xca/0xf0 [zfs] [48816.810533] receive_object+0x450/0x9f0 [zfs] [48816.810538] ? wake_up_common_lock+0x8e/0xc0 [48816.810541] ? __wake_up+0x13/0x20 [48816.810547] ? cv_signal+0x2d/0x40 [spl] [48816.810578] receive_writer_thread+0x420/0xaa0 [zfs] [48816.810582] ? set_curr_task_fair+0x2b/0x60 [48816.810587] ? set_user_nice.part.68+0x103/0x200 [48816.810618] ? receive_free.isra.9+0xc0/0xc0 [zfs] [48816.810625] thread_generic_wrapper+0x74/0x90 [spl] [48816.810655] ? receive_free.isra.9+0xc0/0xc0 [zfs] [48816.810662] ? thread_generic_wrapper+0x74/0x90 [spl] [48816.810666] kthread+0x105/0x140 [48816.810671] ? thread_exit+0x20/0x20 [spl] [48816.810675] ? kthread_create_worker_on_cpu+0x70/0x70 [48816.810679] ret_from_fork+0x35/0x40
@Harvie unfortunately, it looks like the log you provided was not taken after running echo 4294966270 > /sys/module/zfs/parameters/zfs_flags
. That command enables printing error messages to dbgmsg (like the ones from the snippet you originally posted), and the output you gave doesn't have any of those statements. Could you try it one more time?
it looks like the log you provided was not taken after running echo 4294966270 > /sys/module/zfs/parameters/zfs_flags
I am pretty sure, i did that. Btw enable can only be set to 0 and 1:
# cat /sys/module/zfs/parameters/zfs_flags
4294966270
# cat /sys/module/zfs/parameters/zfs_dbgmsg_enable
1
I can try one more time...
[ 1328.892247] VERIFY3(0 == dnode_set_blksz(dn, blocksize, 0, tx)) failed (0 == 95)
[ 1328.910565] PANIC at dnode.c:715:dnode_reallocate()
[ 1328.926314] Showing stack for process 13233
[ 1328.941377] CPU: 3 PID: 13233 Comm: receive_writer Tainted: P O 4.15.18-12-pve #1
[ 1328.961142] Hardware name: /S3000AH, BIOS S3000.86B.02.00.0044.071120071047 07/11/2007
[ 1328.980333] Call Trace:
[ 1328.993919] dump_stack+0x63/0x8b
[ 1329.008327] spl_dumpstack+0x29/0x30 [spl]
[ 1329.023464] spl_panic+0xc8/0x110 [spl]
[ 1329.038269] ? spl_kmem_free+0x33/0x40 [spl]
[ 1329.053549] ? __dprintf+0xf6/0x160 [zfs]
[ 1329.068405] ? _cond_resched+0x1a/0x50
[ 1329.082925] ? mutex_lock+0x12/0x40
[ 1329.097152] ? __set_error+0x2c/0x30 [zfs]
[ 1329.111926] ? dnode_set_blksz+0xb7/0x310 [zfs]
[ 1329.127076] ? spl_kmem_free+0x33/0x40 [spl]
[ 1329.141928] dnode_reallocate+0x24c/0x270 [zfs]
[ 1329.156992] dmu_object_reclaim_dnsize+0xca/0xf0 [zfs]
[ 1329.172653] receive_object+0x450/0x9f0 [zfs]
[ 1329.187444] ? __wake_up_common_lock+0x8e/0xc0
[ 1329.202269] ? __wake_up+0x13/0x20
[ 1329.215953] ? __cv_signal+0x2d/0x40 [spl]
[ 1329.230335] receive_writer_thread+0x420/0xaa0 [zfs]
[ 1329.245549] ? set_curr_task_fair+0x2b/0x60
[ 1329.259945] ? set_user_nice.part.68+0x103/0x200
[ 1329.274793] ? receive_free.isra.9+0xc0/0xc0 [zfs]
[ 1329.289744] thread_generic_wrapper+0x74/0x90 [spl]
[ 1329.304771] ? receive_free.isra.9+0xc0/0xc0 [zfs]
[ 1329.319636] ? thread_generic_wrapper+0x74/0x90 [spl]
[ 1329.334759] kthread+0x105/0x140
[ 1329.347993] ? __thread_exit+0x20/0x20 [spl]
[ 1329.362252] ? kthread_create_worker_on_cpu+0x70/0x70
[ 1329.377258] ret_from_fork+0x35/0x40
There were lots of reapeated lines, so i've used uniq -c to compress it. It removed the duplicates (only those immediately following each other and having identic timetamp). Also it added count of how many times it was repeated:
# cat /sys/module/zfs/parameters/zfs_flags
4294966270
# cat /sys/module/zfs/parameters/zfs_dbgmsg_enable
1
# cat /proc/spl/kstat/zfs/dbgmsg | uniq -c
1 timestamp message
108 1554375109 dbuf.c:2702:dbuf_findbp(): error 2
1 1554375109 dnode.c:1423:dnode_hold_impl(): error 2
1 1554375109 dbuf.c:3159:dbuf_hold_impl_arg(): error 2
1 1554375109 dnode.c:2310:dnode_next_offset_level(): error 3
21 1554375109 dbuf.c:2702:dbuf_findbp(): error 2
1 1554375109 dnode.c:2343:dnode_next_offset_level(): error 3
6 1554375109 dbuf.c:2702:dbuf_findbp(): error 2
313 1554375109 dnode.c:1423:dnode_hold_impl(): error 2
4 1554375109 dbuf.c:2702:dbuf_findbp(): error 2
9 1554375109 dnode.c:1423:dnode_hold_impl(): error 2
12 1554375109 dbuf.c:2702:dbuf_findbp(): error 2
11 1554375109 dnode.c:1423:dnode_hold_impl(): error 2
11 1554375109 dbuf.c:2702:dbuf_findbp(): error 2
7 1554375109 dnode.c:1423:dnode_hold_impl(): error 2
7 1554375109 dbuf.c:2702:dbuf_findbp(): error 2
3 1554375109 dnode.c:1423:dnode_hold_impl(): error 2
2 1554375110 dnode.c:1423:dnode_hold_impl(): error 2
2 1554375110 dbuf.c:3159:dbuf_hold_impl_arg(): error 2
5 1554375110 dnode.c:1423:dnode_hold_impl(): error 2
12 1554375110 dbuf.c:2702:dbuf_findbp(): error 2
28 1554375110 dnode.c:1423:dnode_hold_impl(): error 2
28 1554375110 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375110 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375110 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375110 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375110 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375110 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375110 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375110 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375110 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375110 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375110 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375110 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375110 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375110 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375110 dbuf.c:2702:dbuf_findbp(): error 2
14 1554375110 dnode.c:1423:dnode_hold_impl(): error 2
14 1554375110 dbuf.c:2702:dbuf_findbp(): error 2
2 1554375110 dnode.c:1423:dnode_hold_impl(): error 2
4 1554375110 dbuf.c:2702:dbuf_findbp(): error 2
29201 1554375112 zfeature.c:239:feature_get_refcount(): error 95
11 1554375113 zfeature.c:239:feature_get_refcount(): error 95
1 1554375113 zap.c:740:fzap_checksize(): error 22
368 1554375113 zfeature.c:239:feature_get_refcount(): error 95
165 1554375113 dnode.c:1423:dnode_hold_impl(): error 2
17 1554375113 zfeature.c:239:feature_get_refcount(): error 95
353 1554375113 dnode.c:1423:dnode_hold_impl(): error 2
434 1554375114 zfeature.c:239:feature_get_refcount(): error 95
14 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
14 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
34 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
16 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
20 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
7 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
7 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
39 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
21 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
25 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
28 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
28 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
22 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
22 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
32 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
32 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
18 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
18 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
8 1554375114 dnode.c:1423:dnode_hold_impl(): error 2
28 1554375114 dbuf.c:2702:dbuf_findbp(): error 2
1 1554375115 dnode.c:1762:dnode_set_blksz(): error 95
15526 1554375117 zfeature.c:239:feature_get_refcount(): error 95
1 1554375117 zap.c:740:fzap_checksize(): error 22
287 1554375117 zfeature.c:239:feature_get_refcount(): error 95
I'm seeing the same VERIFY3(0 == dnode_set_blksz(dn, blocksize, 0, tx)) failed (0 == 95)
failure (my duplicate ticket at #8574).
As a note: I did not experience this issue at all until I had a power loss on my system. Since that time, I've had a zfs recv
cause this failure on each boot up (2 so far).
@jmesmon Thats probably just because you didnt have the new module version inserted (it was only installed). Rebooting would have inserted the installed module.
I deleted some snapshots and now, the panic when send and receive the Pool in the assert from line 715 in dnode.c is gone. For me, it looks like that some faulty snapshots trigger the panic, but scrub hasnt detected any errors.
I think this is combination of two issues:
1.) zfs 0.8.0 sometimes creates invalid snapshots
2.) both zfs 0.8.0 and 0.7.13 throw a panic when receiving such invalid snapshot that gets rejected by assertion. this is in my opinion even worse than 1.) because it unrecoverably crashes machine when it could have just reject the snapshot and return error code to userspace... i am afraid that this is bad by design and will affect more than just VERIFY3(0 == dnode_set_blksz(dn, blocksize, 0, tx))
assertion. Chances are that any failed assertion in recv will cause kernel panic rather than userspace error. Which is not good way to handle this.
@wolfgangwuerfl @Harvie I understand why it may seem that there's something wrong with the snapshots, but that's almost certainly not the case. It's likely that those snapshots happen to generate a send stream which requires an object to be reallocated in-place. This is fine but certain conditions are then required to be true, the VERIFY is detecting that something isn't as it should be and there is a flaw somewhere in the logic. This should be impossible under all circumstance so it's handled with a VERIFY instead of returning an error, which would hide the issue.
I've reverted the patch from master which introduced the issue until we can determine the fundamental root cause. I'll be working on this analysis. If you're in a position to do so, it would be great if you could pull the latest version (d93d4b1acdf53a25ad21e20ddfca3b0d58a06cdf or newer) and verify the issue is resolved.
@behlendorf So should i apply the fix on the sending side or receiving side? i am bit confused about where the problem lays :) however i am compiling it on receiving side, which is the crashing machine. on the other hand this was crashing with zfs 0.7.13, which is probably older than the reverted commit...
It does not build:
make[4]: *** No rule to make target '../../../include/spl/sys/dkioc_free_util.h', needed by 'all-am'. Stop.
It does not build:
make[4]: *** No rule to make target '../../../include/spl/sys/dkioc_free_util.h', needed by 'all-am'. Stop.
I was able to build the latest master without issue on Debian Stretch. Have you checked the Custom Package Instructions? See the Git Master Branch section at the bottom of the page to get started and ensure you've installed the build dependencies for your distribution. If you have issues, you can always ask on the #zfsonlinux IRC channel or on the mailing list.
Yes, perhaps i've forgot to run ./autogen.sh... I have just the git checked out in /usr/src/zfs-git and then i use ./autogen.sh and dkms autoinstall zfs/git ... That's how i've built it previously...
BTW i've just rebooted to 0.7.13 and so far it seems to be syncing... unfortunately that might result in me being incapable to reproduce the bug, since it will get synced and i don't know which of the automatic snapshot it was... But it's strange... I had it crashing with 0.7.13 in the beginning. I wonder if it will sync...
update: the problematic dataset succesfully synced to 0.7.13 recv... Is there anyone else, who can try to reproduce the problem on latest git?
Looks OK now: spa_history.c:319:spa_history_log_sync(): txg 1455054 import pool version 5000; software version zfs-0.8.0-rc3-141-gd93d4b1a; uts gentoo 4.20.13-gentoo-BeoCore #3 SMP PREEMPT Sat Apr 6 06:57:34 CEST 2019 x86_64
SEND/RECEIVE works flawlessly!
Doing a scrub, shows a MUCH more cleaner log if doing cat /proc/spl/kstat/zfs/dbgmsg like this
1554528759 metaslab.c:2652:metaslab_condense(): condensing: txg 120481, msp[33] ffff888425a02000, vdev id 0, spa fasttank, smp size 46616, segments 2902, forcing condense=FALSE 1554528759 zap.c:740:fzap_checksize(): error 22 1554528759 zap.c:740:fzap_checksize(): error 22 1554528759 zap.c:740:fzap_checksize(): error 22 1554528759 zap.c:740:fzap_checksize(): error 22 1554528759 zap.c:740:fzap_checksize(): error 22 1554528759 zap.c:740:fzap_checksize(): error 22 1554528759 zap.c:740:fzap_checksize(): error 22 1554528759 zap.c:740:fzap_checksize(): error 22 1554528760 zap.c:740:fzap_checksize(): error 22 1554528763 zap_micro.c:985:zap_lookup_impl(): error 2 1554528763 zap_micro.c:985:zap_lookup_impl(): error 2 1554528763 zap_micro.c:985:zap_lookup_impl(): error 2 1554528763 dmu.c:453:dmu_spill_hold_existing(): error 2 1554528763 sa.c:370:sa_attr_op(): error 2 1554528763 dbuf.c:2702:dbuf_findbp(): error 2 1554528763 dbuf.c:2702:dbuf_findbp(): error 2 1554528763 dmu.c:453:dmu_spill_hold_existing(): error 2 1554528763 sa.c:370:sa_attr_op(): error 2 1554528763 zap_micro.c:985:zap_lookup_impl(): error 2 1554528763 zap_micro.c:985:zap_lookup_impl(): error 2 1554528764 zap.c:740:fzap_checksize(): error 22 1554528765 zap.c:740:fzap_checksize(): error 22 1554528765 zap_micro.c:985:zap_lookup_impl(): error 2 1554528766 zap_micro.c:985:zap_lookup_impl(): error 2 1554528767 zap_micro.c:985:zap_lookup_impl(): error 2 1554528768 zap_micro.c:985:zap_lookup_impl(): error 2 1554528769 zap.c:740:fzap_checksize(): error 22 1554528770 zap.c:740:fzap_checksize(): error 22 1554528771 zap_micro.c:985:zap_lookup_impl(): error 2 1554528771 dmu.c:453:dmu_spill_hold_existing(): error 2 1554528771 dmu.c:453:dmu_spill_hold_existing(): error 2 1554528771 dmu.c:453:dmu_spill_hold_existing(): error 2 1554528771 sa.c:370:sa_attr_op(): error 2 1554528771 dmu.c:453:dmu_spill_hold_existing(): error 2 1554528771 sa.c:370:sa_attr_op(): error 2 1554528771 dmu.c:453:dmu_spill_hold_existing(): error 2 1554528771 sa.c:370:sa_attr_op(): error 2 1554528771 dmu.c:453:dmu_spill_hold_existing(): error 2 1554528771 sa.c:370:sa_attr_op(): error 2 1554528771 zfs_dir.c:1134:zfs_get_xattrdir(): error 2
Yesterday, with the non-reverted Version i got a log flooded with error 85 messages
@Harvie @wolfgangwuerfl we're still trying to get this issue resolved. Can I ask what kind of data is stored in this dataset? What kinds of changes are made between snapshots? The reason it works for you right now is because we reverted the offending patch while we figure out what this issue is.
Can I ask what kind of data is stored in this dataset? What kinds of changes are made between snapshots?
I use ZFS mostly for Proxmox LXC VPS. I've been increasing quota on that dataset before the problem emerged. I think there was some problem with increasing the quota using proxmox web UI. it was increased by different (larger) amount than i've entered, so i had to fix that manualy using zfs set
. I am not sure if that was proxmox or ZFS problem, but i guess the proxmox uses zfs set
anyway... Also the client was copying some files to that dataset because he was merging two LXC containers into one. Can't really tell what the problem was.
after applying this patch it fixes the machine crashing error - however now I'm getting filesystem corruption:
ls: cannot access 'hostname/dirname': Input/output error
total 0 d????????? ? ? ? ? ? dir1 <=== not sure exactly when this occured d????????? ? ? ? ? ? dir2 <=== this appeared after applying the patch
I've started a scrub on the pool to see what happens.
I have 3 other hosts all working ok and no corruption (they're all running same script that does a zfs recv of the same filesystems) - just this one that I recently upgraded to ubuntu 1804.
@stuartabramshumphries What patch are you referring to? I doubt a scrub will fix this particular issue. We are still working on a fix here.
@tcaputi i think probably stems from the fact traditional file-systems report problems that fsck will repair
@stuartabramshumphries if there is detectable/correctable corruption that scrub would 'fix' it should also be fixed on demand when you access the data (at least in terms of corrupt records) if possible
This specific crash was caused by 5dbf8b4ed which has been reverted by d93d4b1acdf53a25ad21e20ddfca3b0d58a06cdf. Issue #7378 which 5dbf8b4ed was intended to fix is specific to encrypted raw receives and is being worked in PR #8607.
I'm closing this issue since the root cause is understood and now resolved in master.
@behlendorf what about the corruption that @stuartabramshumphries reported? looks scary to me...
@stuartabramshumphries could you be specific about exactly which versions you were running and what additional patches were applied.
This issue existed in the master branch from March 27th (5dbf8b4) to April 15th (d93d4b1). Pools using a version of ZFS from master in that range and send/recv may have been effected. The offending change was never included in any tagged ZoL release, or in any downstream releases that I'm aware of.
As part of fixing the underlying root cause we've included regression tests which would have detected this issue if they'd been in place. The new tests provide considerably more test coverage which should help prevent future bugs or regressions.
@behlendorf initially system was ubuntu 1604 running zfs 0.6.5.6-0ubuntu27 - so just the standard setup. we have multiple machines running this for a few years and all ok.
we then upgraded one server to ubuntu 1804 and that also upgraded zfs to 0.7.5 - we then had the problems with zfs recv causing system crashes
i then upgraded to the standard 0.8.0 - this still had the hanging issue
next I upgraded to your patch - branch d93d4b1ac - this is when i noticed the first directories corruption - the second directories corruption occured later - however the box crashing has stopped and other directories are ok..
this dir has been sent/received elsewhere and is ok on the older ubuntu 1604 boxes has this fix been backported to 0.7.5 ?
@stuartabramshumphries I see thank you for the details. We'll make sure to get any relevant send/recv fixes backported to 0.7.14 for the distributions to pick up via their normal process.
@behlendorf another interesting point - the combination of ubuntu 1804 and the 0.8.0 zfs patch no longer crashes/hangs machine - however it is corrupting the filesystem.. we have this running on numerous 1604 servers running zfs 0.6.5.6 and no issues at all - numerous snapshots, can rsync, read .. however when we transmit this zfs snapshots to the 1804 server it corrupts the filesystem - same one pretty much all the time. I've tested this on brand new ubuntu 1804 boxes and so is reproducible .. something odd with this one filesystem that 1804 doesnt like?
Hi! I am not sure how to debug this. Any ideas?
filename: /lib/modules/4.15.18-12-pve/zfs/zfs.ko version: 0.7.13-pve1~bpo1