openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.48k stars 1.74k forks source link

spl_panic when receiving encrypted dataset #6821

Closed sjau closed 6 years ago

sjau commented 6 years ago

System information

Type Version/Name
Distribution Name Nixos
Distribution Version Nixos Unstable Small
Linux Kernel 4.9.58
Architecture x86_64
ZFS Version 0.7.0-1
SPL Version 0.7.0-1

Describe the problem you're observing

I'm trying to backup encrypted datasets from my notebook to my homeserver. Both run same Nixos version. However it doesn't work.

When I use -wR options for full dataset sending, then on the receiving end spl_panic appears and it never finishes.

If I omit the -R option, then full dataset sending works. However when I then try send an incremental set, dame things happens again - spl_panic

Describe how to reproduce the problem

On notebook I have: tank/encZFS/Nixos -> encZFS was create this way: zfs create -o encryption=aes-256-gcm -o keyformat=passphrase -o mountpoint=none -o atime=off ${zfsPool}/encZFS

On the server I have: serviTank/BU/subi -> None of those is encrypted

I then took a snapshot and tried to send like this:

zfs send -wR tank/encZFS/Nixos@encZFSSend_2017-11-04_12-31 | ssh root@10.0.0.3 'zfs receive serviTank/BU/subi/Nixos'

It seems all was transferred correctly:

zfs list
NAME                       USED  AVAIL  REFER  MOUNTPOINT
serviTank                  137G  3.38T    96K  /serviTank
serviTank/BU              95.8G  3.38T    96K  none
serviTank/BU/subi         95.8G  3.38T    96K  none
serviTank/BU/subi/Nixos   95.8G  3.38T  95.8G  legacy
serviTank/encZFS          40.8G  3.38T  1.39M  none
serviTank/encZFS/BU       2.78M  3.38T  1.39M  none
serviTank/encZFS/BU/subi  1.39M  3.38T  1.39M  none
serviTank/encZFS/Nixos    40.8G  3.38T  5.64G  legacy

However the zfs send/receive command never "finishes" and on the server side dmesg shows spl_panic

[ 1556.014734] VERIFY3(0 == dmu_object_dirty_raw(os, object, tx)) failed (0 == 17)
[ 1556.014757] PANIC at dmu.c:937:dmu_free_long_object_impl()
[ 1556.014770] Showing stack for process 18808
[ 1556.014772] CPU: 5 PID: 18808 Comm: receive_writer Tainted: P           O    4.9.58 #1-NixOS
[ 1556.014772] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[ 1556.014773]  ffffaf4e11137b38 ffffffff942f7a12 ffffffffc02c4e03 00000000000003a9
[ 1556.014775]  ffffaf4e11137b48 ffffffffc00759f2 ffffaf4e11137cd0 ffffffffc0075ac5
[ 1556.014776]  0000000000000000 ffffaf4e00000030 ffffaf4e11137ce0 ffffaf4e11137c80
[ 1556.014777] Call Trace:
[ 1556.014781]  [<ffffffff942f7a12>] dump_stack+0x63/0x81
[ 1556.014785]  [<ffffffffc00759f2>] spl_dumpstack+0x42/0x50 [spl]
[ 1556.014787]  [<ffffffffc0075ac5>] spl_panic+0xc5/0x100 [spl]
[ 1556.014806]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1556.014816]  [<ffffffffc0190107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1556.014825]  [<ffffffffc0190443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1556.014826]  [<ffffffff945671e2>] ? mutex_lock+0x12/0x30
[ 1556.014839]  [<ffffffffc01bf84b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1556.014848]  [<ffffffffc019020b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1556.014857]  [<ffffffffc017aa7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1556.014865]  [<ffffffffc017ab24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1556.014873]  [<ffffffffc0187858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1556.014881]  [<ffffffffc0187cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1556.014883]  [<ffffffff941ce021>] ? __slab_free+0xa1/0x2e0
[ 1556.014884]  [<ffffffff940a5200>] ? set_next_entity+0x70/0x890
[ 1556.014886]  [<ffffffffc006ff53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1556.014887]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1556.014895]  [<ffffffffc0187910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1556.014896]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1556.014898]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1556.014899]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1556.014899]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1556.014901]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 1721.304223] INFO: task txg_quiesce:468 blocked for more than 120 seconds.
[ 1721.304317]       Tainted: P           O    4.9.58 #1-NixOS
[ 1721.304376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1721.304456] txg_quiesce     D    0   468      2 0x00000000
[ 1721.304463]  ffff92d789a44400 0000000000000000 ffff92d7afbd7ec0 ffff92d783539a80
[ 1721.304469]  ffff92d78c355cc0 ffffaf4e10e1fd30 ffffffff94565082 ffffaf4e10e1fd00
[ 1721.304474]  0000000000000246 0000000180200010 ffffaf4e10e1fd50 ffff92d783539a80
[ 1721.304479] Call Trace:
[ 1721.304493]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1721.304500]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1721.304511]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 1721.304518]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1721.304525]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 1721.304591]  [<ffffffffc01de633>] txg_quiesce_thread+0x2e3/0x3f0 [zfs]
[ 1721.304640]  [<ffffffffc01de350>] ? txg_wait_open+0x100/0x100 [zfs]
[ 1721.304647]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1721.304654]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1721.304658]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1721.304662]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1721.304664]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1721.304669]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 1721.304686] INFO: task zfs:15048 blocked for more than 120 seconds.
[ 1721.304753]       Tainted: P           O    4.9.58 #1-NixOS
[ 1721.304810] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1721.304890] zfs             D    0 15048  15040 0x00000000
[ 1721.304894]  ffff92d7617f0400 0000000000000000 ffff92d7afb57ec0 ffff92d78bf7cf80
[ 1721.304900]  ffff92d78c354240 ffffaf4e032438c8 ffffffff94565082 ffffffffc016ec26
[ 1721.304904]  ffff92d68c3e9730 0000000000000001 ffffaf4e032438d0 ffff92d78bf7cf80
[ 1721.304909] Call Trace:
[ 1721.304916]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1721.304953]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1721.304959]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1721.304967]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 1721.304972]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1721.304979]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 1721.305015]  [<ffffffffc0173f92>] bqueue_enqueue+0x62/0xe0 [zfs]
[ 1721.305059]  [<ffffffffc01898c1>] dmu_recv_stream+0x691/0x11c0 [zfs]
[ 1721.305066]  [<ffffffffc009062a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1721.305116]  [<ffffffffc02108fa>] ? zfs_set_prop_nvlist+0x2fa/0x510 [zfs]
[ 1721.305190]  [<ffffffffc0211057>] zfs_ioc_recv_impl+0x407/0x1170 [zfs]
[ 1721.305241]  [<ffffffffc02123f9>] zfs_ioc_recv_new+0x369/0x400 [zfs]
[ 1721.305254]  [<ffffffffc00702cc>] ? spl_kmem_alloc_impl+0x9c/0x180 [spl]
[ 1721.305263]  [<ffffffffc00724a9>] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1721.305270]  [<ffffffffc00958af>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1721.305276]  [<ffffffffc009062a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1721.305283]  [<ffffffffc00906ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1721.305330]  [<ffffffffc020f0eb>] zfsdev_ioctl+0x20b/0x660 [zfs]
[ 1721.305340]  [<ffffffff941ff604>] do_vfs_ioctl+0x94/0x5c0
[ 1721.305347]  [<ffffffff9405dece>] ? __do_page_fault+0x25e/0x4c0
[ 1721.305352]  [<ffffffff941ffba9>] SyS_ioctl+0x79/0x90
[ 1721.305359]  [<ffffffff94569ef7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1721.305366] INFO: task receive_writer:18808 blocked for more than 120 seconds.
[ 1721.305442]       Tainted: P           O    4.9.58 #1-NixOS
[ 1721.305500] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1721.305637] receive_writer  D    0 18808      2 0x00000000
[ 1721.305644]  ffff92d789a44400 0000000000000000 ffff92d7afb57ec0 ffff92d78bf7ea00
[ 1721.305654]  ffff92d78bf78000 ffffaf4e11137b30 ffffffff94565082 0000000000000000
[ 1721.305662]  ffffffffc02af5d0 00ffffffc02c51d0 0000000000000001 ffff92d78bf7ea00
[ 1721.305671] Call Trace:
[ 1721.305681]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1721.305691]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1721.305703]  [<ffffffffc0075aeb>] spl_panic+0xeb/0x100 [spl]
[ 1721.305765]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1721.305821]  [<ffffffffc0190107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1721.305873]  [<ffffffffc0190443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1721.305879]  [<ffffffff945671e2>] ? mutex_lock+0x12/0x30
[ 1721.305943]  [<ffffffffc01bf84b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1721.305997]  [<ffffffffc019020b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1721.306051]  [<ffffffffc017aa7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1721.306102]  [<ffffffffc017ab24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1721.306147]  [<ffffffffc0187858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1721.306207]  [<ffffffffc0187cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1721.306214]  [<ffffffff941ce021>] ? __slab_free+0xa1/0x2e0
[ 1721.306221]  [<ffffffff940a5200>] ? set_next_entity+0x70/0x890
[ 1721.306231]  [<ffffffffc006ff53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1721.306244]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1721.306288]  [<ffffffffc0187910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1721.306296]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1721.306303]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1721.306307]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1721.306316]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1721.306323]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 1844.182739] INFO: task txg_quiesce:468 blocked for more than 120 seconds.
[ 1844.182831]       Tainted: P           O    4.9.58 #1-NixOS
[ 1844.182889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1844.182967] txg_quiesce     D    0   468      2 0x00000000
[ 1844.182974]  ffff92d789a44400 0000000000000000 ffff92d7afbd7ec0 ffff92d783539a80
[ 1844.182980]  ffff92d78c355cc0 ffffaf4e10e1fd30 ffffffff94565082 ffffaf4e10e1fd00
[ 1844.182985]  0000000000000246 0000000180200010 ffffaf4e10e1fd50 ffff92d783539a80
[ 1844.182990] Call Trace:
[ 1844.183003]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1844.183009]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1844.183019]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 1844.183026]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1844.183033]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 1844.183094]  [<ffffffffc01de633>] txg_quiesce_thread+0x2e3/0x3f0 [zfs]
[ 1844.183144]  [<ffffffffc01de350>] ? txg_wait_open+0x100/0x100 [zfs]
[ 1844.183151]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1844.183157]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1844.183162]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1844.183165]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1844.183168]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1844.183172]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 1844.183190] INFO: task zfs:15048 blocked for more than 120 seconds.
[ 1844.183255]       Tainted: P           O    4.9.58 #1-NixOS
[ 1844.183312] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1844.183392] zfs             D    0 15048  15040 0x00000000
[ 1844.183397]  ffff92d7617f0400 0000000000000000 ffff92d7afb57ec0 ffff92d78bf7cf80
[ 1844.183402]  ffff92d78c354240 ffffaf4e032438c8 ffffffff94565082 ffffffffc016ec26
[ 1844.183407]  ffff92d68c3e9730 0000000000000001 ffffaf4e032438d0 ffff92d78bf7cf80
[ 1844.183411] Call Trace:
[ 1844.183418]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1844.183455]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1844.183461]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1844.183468]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 1844.183473]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1844.183480]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 1844.183516]  [<ffffffffc0173f92>] bqueue_enqueue+0x62/0xe0 [zfs]
[ 1844.183560]  [<ffffffffc01898c1>] dmu_recv_stream+0x691/0x11c0 [zfs]
[ 1844.183567]  [<ffffffffc009062a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1844.183617]  [<ffffffffc02108fa>] ? zfs_set_prop_nvlist+0x2fa/0x510 [zfs]
[ 1844.183663]  [<ffffffffc0211057>] zfs_ioc_recv_impl+0x407/0x1170 [zfs]
[ 1844.183731]  [<ffffffffc02123f9>] zfs_ioc_recv_new+0x369/0x400 [zfs]
[ 1844.183744]  [<ffffffffc00702cc>] ? spl_kmem_alloc_impl+0x9c/0x180 [spl]
[ 1844.183751]  [<ffffffffc00724a9>] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1844.183758]  [<ffffffffc00958af>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1844.183764]  [<ffffffffc009062a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1844.183770]  [<ffffffffc00906ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1844.183819]  [<ffffffffc020f0eb>] zfsdev_ioctl+0x20b/0x660 [zfs]
[ 1844.183828]  [<ffffffff941ff604>] do_vfs_ioctl+0x94/0x5c0
[ 1844.183835]  [<ffffffff9405dece>] ? __do_page_fault+0x25e/0x4c0
[ 1844.183841]  [<ffffffff941ffba9>] SyS_ioctl+0x79/0x90
[ 1844.183846]  [<ffffffff94569ef7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1844.183852] INFO: task receive_writer:18808 blocked for more than 120 seconds.
[ 1844.183929]       Tainted: P           O    4.9.58 #1-NixOS
[ 1844.183987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1844.184067] receive_writer  D    0 18808      2 0x00000000
[ 1844.184071]  ffff92d789a44400 0000000000000000 ffff92d7afb57ec0 ffff92d78bf7ea00
[ 1844.184076]  ffff92d78bf78000 ffffaf4e11137b30 ffffffff94565082 0000000000000000
[ 1844.184082]  ffffffffc02af5d0 00ffffffc02c51d0 0000000000000001 ffff92d78bf7ea00
[ 1844.184088] Call Trace:
[ 1844.184095]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1844.184104]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1844.184112]  [<ffffffffc0075aeb>] spl_panic+0xeb/0x100 [spl]
[ 1844.184149]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1844.184196]  [<ffffffffc0190107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1844.184256]  [<ffffffffc0190443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1844.184262]  [<ffffffff945671e2>] ? mutex_lock+0x12/0x30
[ 1844.184334]  [<ffffffffc01bf84b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1844.184391]  [<ffffffffc019020b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1844.184446]  [<ffffffffc017aa7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1844.184498]  [<ffffffffc017ab24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1844.184557]  [<ffffffffc0187858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1844.184607]  [<ffffffffc0187cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1844.184616]  [<ffffffff941ce021>] ? __slab_free+0xa1/0x2e0
[ 1844.184622]  [<ffffffff940a5200>] ? set_next_entity+0x70/0x890
[ 1844.184630]  [<ffffffffc006ff53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1844.184642]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1844.184705]  [<ffffffffc0187910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1844.184714]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1844.184719]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1844.184724]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1844.184730]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1844.184736]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 1967.061258] INFO: task txg_quiesce:468 blocked for more than 120 seconds.
[ 1967.061345]       Tainted: P           O    4.9.58 #1-NixOS
[ 1967.061403] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.061483] txg_quiesce     D    0   468      2 0x00000000
[ 1967.061490]  ffff92d789a44400 0000000000000000 ffff92d7afbd7ec0 ffff92d783539a80
[ 1967.061496]  ffff92d78c355cc0 ffffaf4e10e1fd30 ffffffff94565082 ffffaf4e10e1fd00
[ 1967.061501]  0000000000000246 0000000180200010 ffffaf4e10e1fd50 ffff92d783539a80
[ 1967.061507] Call Trace:
[ 1967.061521]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1967.061528]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1967.061539]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 1967.061546]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1967.061554]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 1967.061638]  [<ffffffffc01de633>] txg_quiesce_thread+0x2e3/0x3f0 [zfs]
[ 1967.061690]  [<ffffffffc01de350>] ? txg_wait_open+0x100/0x100 [zfs]
[ 1967.061697]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1967.061704]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1967.061708]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1967.061712]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1967.061714]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1967.061719]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 1967.061738] INFO: task zfs:15048 blocked for more than 120 seconds.
[ 1967.061804]       Tainted: P           O    4.9.58 #1-NixOS
[ 1967.061862] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.061942] zfs             D    0 15048  15040 0x00000000
[ 1967.061947]  ffff92d7617f0400 0000000000000000 ffff92d7afb57ec0 ffff92d78bf7cf80
[ 1967.061952]  ffff92d78c354240 ffffaf4e032438c8 ffffffff94565082 ffffffffc016ec26
[ 1967.061957]  ffff92d68c3e9730 0000000000000001 ffffaf4e032438d0 ffff92d78bf7cf80
[ 1967.061961] Call Trace:
[ 1967.061968]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1967.062006]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1967.062012]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1967.062020]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 1967.062024]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1967.062032]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 1967.062068]  [<ffffffffc0173f92>] bqueue_enqueue+0x62/0xe0 [zfs]
[ 1967.062112]  [<ffffffffc01898c1>] dmu_recv_stream+0x691/0x11c0 [zfs]
[ 1967.062119]  [<ffffffffc009062a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1967.062170]  [<ffffffffc02108fa>] ? zfs_set_prop_nvlist+0x2fa/0x510 [zfs]
[ 1967.062243]  [<ffffffffc0211057>] zfs_ioc_recv_impl+0x407/0x1170 [zfs]
[ 1967.062293]  [<ffffffffc02123f9>] zfs_ioc_recv_new+0x369/0x400 [zfs]
[ 1967.062306]  [<ffffffffc00702cc>] ? spl_kmem_alloc_impl+0x9c/0x180 [spl]
[ 1967.062315]  [<ffffffffc00724a9>] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1967.062322]  [<ffffffffc00958af>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1967.062328]  [<ffffffffc009062a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1967.062335]  [<ffffffffc00906ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1967.062381]  [<ffffffffc020f0eb>] zfsdev_ioctl+0x20b/0x660 [zfs]
[ 1967.062391]  [<ffffffff941ff604>] do_vfs_ioctl+0x94/0x5c0
[ 1967.062398]  [<ffffffff9405dece>] ? __do_page_fault+0x25e/0x4c0
[ 1967.062403]  [<ffffffff941ffba9>] SyS_ioctl+0x79/0x90
[ 1967.062410]  [<ffffffff94569ef7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1967.062418] INFO: task receive_writer:18808 blocked for more than 120 seconds.
[ 1967.062494]       Tainted: P           O    4.9.58 #1-NixOS
[ 1967.062552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.062632] receive_writer  D    0 18808      2 0x00000000
[ 1967.062637]  ffff92d789a44400 0000000000000000 ffff92d7afb57ec0 ffff92d78bf7ea00
[ 1967.062645]  ffff92d78bf78000 ffffaf4e11137b30 ffffffff94565082 0000000000000000
[ 1967.062649]  ffffffffc02af5d0 00ffffffc02c51d0 0000000000000001 ffff92d78bf7ea00
[ 1967.062654] Call Trace:
[ 1967.062661]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1967.062667]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1967.062677]  [<ffffffffc0075aeb>] spl_panic+0xeb/0x100 [spl]
[ 1967.062714]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1967.062760]  [<ffffffffc0190107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1967.062804]  [<ffffffffc0190443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1967.062813]  [<ffffffff945671e2>] ? mutex_lock+0x12/0x30
[ 1967.062870]  [<ffffffffc01bf84b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1967.062913]  [<ffffffffc019020b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1967.062955]  [<ffffffffc017aa7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1967.062995]  [<ffffffffc017ab24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1967.063038]  [<ffffffffc0187858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1967.063077]  [<ffffffffc0187cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1967.063085]  [<ffffffff941ce021>] ? __slab_free+0xa1/0x2e0
[ 1967.063092]  [<ffffffff940a5200>] ? set_next_entity+0x70/0x890
[ 1967.063099]  [<ffffffffc006ff53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1967.063105]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1967.063145]  [<ffffffffc0187910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1967.063153]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1967.063168]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1967.063174]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1967.063180]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1967.063187]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 2089.939773] INFO: task txg_quiesce:468 blocked for more than 120 seconds.
[ 2089.939861]       Tainted: P           O    4.9.58 #1-NixOS
[ 2089.939919] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2089.940000] txg_quiesce     D    0   468      2 0x00000000
[ 2089.940007]  ffff92d789a44400 0000000000000000 ffff92d7afbd7ec0 ffff92d783539a80
[ 2089.940013]  ffff92d78c355cc0 ffffaf4e10e1fd30 ffffffff94565082 ffffaf4e10e1fd00
[ 2089.940018]  0000000000000246 0000000180200010 ffffaf4e10e1fd50 ffff92d783539a80
[ 2089.940023] Call Trace:
[ 2089.940037]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 2089.940043]  [<ffffffff94565586>] schedule+0x36/0x80
[ 2089.940054]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 2089.940061]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 2089.940068]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 2089.940132]  [<ffffffffc01de633>] txg_quiesce_thread+0x2e3/0x3f0 [zfs]
[ 2089.940180]  [<ffffffffc01de350>] ? txg_wait_open+0x100/0x100 [zfs]
[ 2089.940188]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 2089.940194]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 2089.940199]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 2089.940202]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 2089.940205]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60

As said, if I don't use the -R option on first dataset sending to server it works fine, but when I then try to send an incremental snapshot the same thing happens.

I also tried to send the snapshots to an encrypted child dataset on the server with the same results.

tcaputi commented 6 years ago

@sjau Have you tried this with the latest code from master? I recently fixed a good number of these bugs a couple weeks ago. If you are on the latest code, could you provide the send file that is causing the panic? Thanks a lot.

sjau commented 6 years ago

Not sure what Nixos unstable uses... I'll have to find out.

What do you mean by providing the send file that is causing the panic?

If have don't have a copy yet on the remote server and use the -R flag it causes panic. If I just use -w flag it works for the inital dataset (snapshot) sending. But then when I try to send an incremental snapshot it will panic again.

Smallest dataset is ~ 95GB

sjau commented 6 years ago

Ok, already found it out... I use the unstable version because of encryption.

Currently Nixos uses for ZFS unstable commit 7670f721fc82e6cdcdd31f83760a79b6f2f2b998

Checked it on server and on notebook - both use same revision - https://github.com/NixOS/nixpkgs/blob/master/pkgs/os-specific/linux/zfs/default.nix#L164

tcaputi commented 6 years ago

Not sure what Nixos unstable uses... I'll have to find out.

Looking at the stack trace it seems that you may(?) have the patch I was talking about after all. I'll try to look into this. It would be really helpful if you could provide a minimal set of commands to reproduce the problem if possible. If not, I'll do my best to figure it out on my own

What do you mean by providing the send file that is causing the panic?

I meant the output of the zfs send command you are using. But if your dataset is 95GB this might not be feasible.

sjau commented 6 years ago

I can test it with a smaller dataset just to see how it goes.

tcaputi commented 6 years ago

Actually, it would be really helpful if you could cause it to crash one more time with the following debugging on the receiving side:

echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
# perform the send, triggering the panic

Then provide the output of cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'

sjau commented 6 years ago

The commands are simple:

If dataset does not exist, I can use:

zfs send -w "${srcDS}@${dsPrefix}_${now}" | ssh "${dstSrv}" "zfs receive ${dstDS}/${dsName}

That will create the dataset nicely on the destination server - alas I haven't tried to mount it and check it... However, if I use the -R option, it causes panic:

zfs send -wR "${srcDS}@${dsPrefix}_${now}" | ssh "${dstSrv}" "zfs receive ${dstDS}/${dsName}

Also for incremental:

If I didn't use the -R option for initial dataset creation on the server and then try to send an incremental dataset like below, it also causes panic:

zfs send -wi "#${incDS}" "${srcDS}@${dsPrefix}_${now}" | ssh "${dstSrv}" "zfs receive ${dstDS}/${dsName}"

sjau commented 6 years ago

Shall I crash it with initial dataset (-R) or incremental set?

tcaputi commented 6 years ago

The commands are simple:

We actually already have tests for these commands in our test suite (which is run against every pull request). This is probably a problem that is triggered by the data in your dataset.

Shall I crash it with initial dataset (-R) or incremental set?

The initial one please. Thanks a lot.

sjau commented 6 years ago

Doing so now... will take a few minutes to send 95GB or so....

sjau commented 6 years ago

FYI: The script I created to take snapshots and bookmarks and send them: https://paste.simplylinux.ch/view/raw/d7809481

sjau commented 6 years ago

Nothing found

[189601.664736] Showing stack for process 3880
[189601.664737] CPU: 7 PID: 3880 Comm: receive_writer Tainted: P           O    4.9.58 #1-NixOS
[189601.664738] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[189601.664739]  ffffbd1250b2fb38 ffffffffa88f7a12 ffffffffc02ffe03 00000000000003a9
[189601.664740]  ffffbd1250b2fb48 ffffffffc00b09f2 ffffbd1250b2fcd0 ffffffffc00b0ac5
[189601.664741]  0000000000000000 ffffbd1200000030 ffffbd1250b2fce0 ffffbd1250b2fc80
[189601.664742] Call Trace:
[189601.664747]  [<ffffffffa88f7a12>] dump_stack+0x63/0x81
[189601.664752]  [<ffffffffc00b09f2>] spl_dumpstack+0x42/0x50 [spl]
[189601.664753]  [<ffffffffc00b0ac5>] spl_panic+0xc5/0x100 [spl]
[189601.664774]  [<ffffffffc01a9c26>] ? dbuf_rele+0x36/0x40 [zfs]
[189601.664784]  [<ffffffffc01cb107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[189601.664792]  [<ffffffffc01cb443>] ? dnode_setdirty+0x83/0x100 [zfs]
[189601.664793]  [<ffffffffa8b671e2>] ? mutex_lock+0x12/0x30
[189601.664805]  [<ffffffffc01fa84b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[189601.664814]  [<ffffffffc01cb20b>] ? dnode_hold+0x1b/0x20 [zfs]
[189601.664823]  [<ffffffffc01b5a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[189601.664831]  [<ffffffffc01b5b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[189601.664840]  [<ffffffffc01c2858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[189601.664848]  [<ffffffffc01c2cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[189601.664849]  [<ffffffffa87ce021>] ? __slab_free+0xa1/0x2e0
[189601.664851]  [<ffffffffa86a5200>] ? set_next_entity+0x70/0x890
[189601.664852]  [<ffffffffc00aaf53>] ? spl_kmem_free+0x33/0x40 [spl]
[189601.664854]  [<ffffffffc00ad5d0>] ? __thread_exit+0x20/0x20 [spl]
[189601.664862]  [<ffffffffc01c2910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[189601.664863]  [<ffffffffc00ad5d0>] ? __thread_exit+0x20/0x20 [spl]
[189601.664864]  [<ffffffffc00ad642>] thread_generic_wrapper+0x72/0x80 [spl]
[189601.664865]  [<ffffffffa868e457>] kthread+0xd7/0xf0
[189601.664866]  [<ffffffffa868e380>] ? kthread_park+0x60/0x60
[189601.664867]  [<ffffffffa8b6a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
root@servi-nixos:~# 
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg  > /tmp/dbgmsg
root@servi-nixos:~# ls -al /tmp/dbgmsg 
-rw-r--r-- 1 root root 3462117 Nov  6 20:35 /tmp/dbgmsg

The whole dbgmsg output can be viewed here https://paste.simplylinux.ch/view/raw/65e6bc7c

tcaputi commented 6 years ago

@sjau I'm sorry. I forgot a step that you need to run on non-debug builds. Unfortunately you will need to restart the receiving machine and run this immediately after echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable

echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
sjau commented 6 years ago

Still no output:

root@servi-nixos:~# zfs destroy -r serviTank/encZFS/BU/subi/Nixos
root@servi-nixos:~# echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
root@servi-nixos:~# echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
root@servi-nixos:~# dmesg | tail -n 20
[ 1314.303517]  [<ffffffffc029a107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1314.303526]  [<ffffffffc029a443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1314.303528]  [<ffffffffbd7671e2>] ? mutex_lock+0x12/0x30
[ 1314.303540]  [<ffffffffc02c984b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1314.303549]  [<ffffffffc029a20b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1314.303557]  [<ffffffffc0284a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1314.303565]  [<ffffffffc0284b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1314.303573]  [<ffffffffc0291858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1314.303581]  [<ffffffffc0291cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1314.303583]  [<ffffffffbd3ce021>] ? __slab_free+0xa1/0x2e0
[ 1314.303584]  [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1314.303586]  [<ffffffffc0176f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1314.303587]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1314.303595]  [<ffffffffc0291910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1314.303596]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1314.303597]  [<ffffffffc0179642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1314.303598]  [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1314.303599]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1314.303599]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1314.303600]  [<ffffffffbd76a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
root@servi-nixos:~# dmesg | tail -n 30
[ 1314.303473] CPU: 6 PID: 22544 Comm: receive_writer Tainted: P           O    4.9.58 #1-NixOS
[ 1314.303473] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[ 1314.303474]  ffff9cc8839a3b38 ffffffffbd4f7a12 ffffffffc03cee03 00000000000003a9
[ 1314.303476]  ffff9cc8839a3b48 ffffffffc017c9f2 ffff9cc8839a3cd0 ffffffffc017cac5
[ 1314.303477]  0000000000000028 ffff9cc800000030 ffff9cc8839a3ce0 ffff9cc8839a3c80
[ 1314.303478] Call Trace:
[ 1314.303482]  [<ffffffffbd4f7a12>] dump_stack+0x63/0x81
[ 1314.303486]  [<ffffffffc017c9f2>] spl_dumpstack+0x42/0x50 [spl]
[ 1314.303487]  [<ffffffffc017cac5>] spl_panic+0xc5/0x100 [spl]
[ 1314.303507]  [<ffffffffc030d3c2>] ? __set_error+0x22/0x30 [zfs]
[ 1314.303517]  [<ffffffffc029a107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1314.303526]  [<ffffffffc029a443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1314.303528]  [<ffffffffbd7671e2>] ? mutex_lock+0x12/0x30
[ 1314.303540]  [<ffffffffc02c984b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1314.303549]  [<ffffffffc029a20b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1314.303557]  [<ffffffffc0284a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1314.303565]  [<ffffffffc0284b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1314.303573]  [<ffffffffc0291858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1314.303581]  [<ffffffffc0291cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1314.303583]  [<ffffffffbd3ce021>] ? __slab_free+0xa1/0x2e0
[ 1314.303584]  [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1314.303586]  [<ffffffffc0176f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1314.303587]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1314.303595]  [<ffffffffc0291910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1314.303596]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1314.303597]  [<ffffffffc0179642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1314.303598]  [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1314.303599]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1314.303599]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1314.303600]  [<ffffffffbd76a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
root@servi-nixos:~# 
tcaputi commented 6 years ago

Hmmmm. what about grepping just for "error". Sorry for all the trouble.

sjau commented 6 years ago

there's tons of them:

https://paste.simplylinux.ch/view/raw/30b64aec

No need to be sorry :) You've created the encryption and provide help... so no need to be sorry...

tcaputi commented 6 years ago

Is the error reoprted in dmesg still failed (0 == 17) (the top line of the stack trace is cut off above).

sjau commented 6 years ago

Here's more dmesg output:

[    0.000000] x86/fpu: Using 'eager' FPU context switches.
[    0.000000] BIOS-e820: [mem 0x00000000bd3c4000-0x00000000bd451fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed03fff] reserved
[    0.000000]   00000-9FFFF write-back
[    0.000000]   4 base 00BF000000 mask 7FFF000000 uncachable
[    0.000000] BRK [0x13450b000, 0x13450bfff] PGTABLE
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: HPET 0x00000000BD5B3510 000038 (v01 ALASKA A M I    01072009 AMI. 00000005)
[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000000]   node   0: [mem 0x00000000bdfff000-0x00000000bdffffff]
[    0.000000]   Normal zone: 7536128 pages, LIFO batch:31
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] PM: Registered nosave memory: [mem 0xaae99000-0xab36cfff]
[    0.000000] PM: Registered nosave memory: [mem 0xfec00000-0xfec00fff]
[    0.000000] e820: [mem 0xcf200000-0xefffffff] available for PCI devices
c56v2h-nixos-system-servi-nixos-18.03pre118381.4068703502/init loglevel=4
[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
[    0.017115] CPU: Processor Core ID: 0
[    0.035611] ... generic registers:      4
[    0.480112] NET: Registered protocol family 16
[    0.493460] ACPI: Added _OSI(Processor Aggregator Device)
[    0.500638] ACPI: Power Resource [PG01] (on)
[    0.509007] pci_bus 0000:00: root bus resource [mem 0x000e4000-0x000e7fff window]
[    0.509339] pci 0000:00:14.0: System wakeup disabled by ACPI
[    0.509775] pci 0000:00:1b.0: [8086:8ca0] type 00 class 0x040300
[    0.510310] pci 0000:00:1f.0: [8086:8cc4] type 00 class 0x060100
[    0.510885] pci 0000:00:1c.2: PCI bridge to [bus 02-03] (subtractive decode)
[    0.511003] pci 0000:02:00.0:   bridge window [mem 0x000d0000-0x000d3fff window] (subtractive decode)
[    0.511801] ACPI: PCI Interrupt Link [LNKG] (IRQs *3 4 5 6 10 11 12 14 15)
[    0.514591] e820: reserve RAM buffer [mem 0x82fe00000-0x82fffffff]
[    0.520322] system 00:03: Plug and Play ACPI device, IDs INT3f0d PNP0c02 (active)
[    0.520900] system 00:07: Plug and Play ACPI device, IDs PNP0c02 (active)
[    0.526565] pci_bus 0000:02: resource 8 [mem 0x000d4000-0x000d7fff window]
[    0.527013] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[    0.645616] intel_idle: MWAIT substates: 0x42120
[    1.645668] tsc: Refined TSC clocksource calibration: 3399.013 MHz
[    2.841518] rtc_cmos 00:02: rtc core: registered rtc_cmos as rtc0
[    2.868667] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    2.880924] xhci_hcd 0000:00:14.0: hcc params 0x200077c1 hci version 0x100 quirks 0x00009810
[    2.893645] ata1: SATA max UDMA/133 abar m2048@0xefc39000 port 0xefc39100 irq 26
[    3.204474] ata3.00: configured for UDMA/133
[    3.269092] sd 2:0:0:0: [sdc] 4096-byte physical blocks
[    3.321525] sd 0:0:0:0: [sda] Attached SCSI disk
[   54.903311] NET: Registered protocol family 10
[   58.426044] loop: module loaded
[   59.727261] snd_hda_codec_realtek hdaudioC1D0:    speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
[   59.739982] input: HDA Intel PCH Front Headphone as /devices/pci0000:00/0000:00:1b.0/sound/card1/input12
[   60.475133] Console: switching to colour dummy device 80x25
[   60.572747] audit: type=1130 audit(1509997934.560:7): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-remount-fs comm="systemd" exe="/nix/store/dbay1civ36b8hkr26wsxh586r331wy94-systemd-234/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   62.495233] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
96F6E3A20
[ 1314.303437] VERIFY3(0 == dmu_object_dirty_raw(os, object, tx)) failed (0 == 17)
[ 1314.303482]  [<ffffffffbd4f7a12>] dump_stack+0x63/0x81
[ 1314.303540]  [<ffffffffc02c984b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1314.303584]  [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1314.303599]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1469.502873] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 1475.097656]  ffff95ee0c354240 ffff9cc890f0fd30 ffffffffbd765082 ffffffffc030cf52
[ 1475.097747]  [<ffffffffc017ecb8>] cv_wait_common+0x128/0x140 [spl]
[ 1475.097866]  [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1475.098077]  ffff95ee0c351a80 ffff9cc883c878c8 ffffffffbd765082 0000000000000000
[ 1475.098114]  [<ffffffffc017ece5>] __cv_wait+0x15/0x20 [spl]
[ 1475.098348]  [<ffffffffc01976ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1475.098735] Call Trace:
[ 1475.099203]  [<ffffffffc0291cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1597.966287] txg_quiesce     D    0   455      2 0x00000000
[ 1597.966346]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1597.966405]  0000000000000000 000000000003328e 0000000000000400 ffff95eddcab6a00
[ 1597.966431]  [<ffffffffc019762a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1597.966467]  [<ffffffffc01976ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1597.966511] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1597.966536]  [<ffffffffc017caeb>] spl_panic+0xeb/0x100 [spl]
[ 1597.966585]  [<ffffffffc029a20b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1597.966620]  [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1597.966634]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1597.966681]  ffff95ee0c351a80 ffff9cc88326ba90 ffffffffbd765082 ffffffffc030cf52
[ 1597.966699]  [<ffffffffc017ece5>] __cv_wait+0x15/0x20 [spl]
[ 1597.966764]  [<ffffffffc02a4fe8>] dsl_dataset_snapshot+0x138/0x2a0 [zfs]
[ 1597.966782]  [<ffffffffc031705c>] zfs_ioc_snapshot+0x32c/0x390 [zfs]
[ 1720.835451] INFO: task txg_quiesce:455 blocked for more than 120 seconds.
[ 1720.835664]  ffff95ee0c354240 ffff9cc890f0fd30 ffffffffbd765082 ffffffffc030cf52
[ 1720.835752]  [<ffffffffbd765586>] schedule+0x36/0x80
[ 1720.835865]  [<ffffffffc02e8350>] ? txg_wait_open+0x100/0x100 [zfs]
[ 1720.835871]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1720.835877]  [<ffffffffc0179642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1720.835881]  [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1720.835884]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1720.835888]  [<ffffffffbd76a155>] ret_from_fork+0x25/0x30
[ 1720.835907] INFO: task zfs:12024 blocked for more than 120 seconds.
[ 1720.835965]       Tainted: P           O    4.9.58 #1-NixOS
[ 1720.836016] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1720.836086] zfs             D    0 12024  11990 0x00000000
[ 1720.836090]  ffff95eddb621800 0000000000000000 ffff95ee2fa97ec0 ffff95eddcab6a00
[ 1720.836095]  ffff95ee0c351a80 ffff9cc883c878c8 ffffffffbd765082 0000000000000000
[ 1720.836099]  0000000000000000 000000000003328e 0000000000000400 ffff95eddcab6a00
[ 1720.836104] Call Trace:
[ 1720.836110]  [<ffffffffbd765082>] ? __schedule+0x192/0x660
[ 1720.836114]  [<ffffffffbd765586>] schedule+0x36/0x80
[ 1720.836121]  [<ffffffffc017ecb8>] cv_wait_common+0x128/0x140 [spl]
[ 1720.836125]  [<ffffffffbd2ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1720.836131]  [<ffffffffc017ece5>] __cv_wait+0x15/0x20 [spl]
[ 1720.836165]  [<ffffffffc027df92>] bqueue_enqueue+0x62/0xe0 [zfs]
[ 1720.836205]  [<ffffffffc02938c1>] dmu_recv_stream+0x691/0x11c0 [zfs]
[ 1720.836211]  [<ffffffffc019762a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1720.836256]  [<ffffffffc031a8fa>] ? zfs_set_prop_nvlist+0x2fa/0x510 [zfs]
[ 1720.836296]  [<ffffffffc031b057>] zfs_ioc_recv_impl+0x407/0x1170 [zfs]
[ 1720.836337]  [<ffffffffc031c3f9>] zfs_ioc_recv_new+0x369/0x400 [zfs]
[ 1720.836343]  [<ffffffffc01772cc>] ? spl_kmem_alloc_impl+0x9c/0x180 [spl]
[ 1720.836373]  [<ffffffffc01794a9>] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1720.836382]  [<ffffffffc019c8af>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1720.836390]  [<ffffffffc019762a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1720.836399]  [<ffffffffc01976ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1720.836451]  [<ffffffffc03190eb>] zfsdev_ioctl+0x20b/0x660 [zfs]
[ 1720.836462]  [<ffffffffbd3ff604>] do_vfs_ioctl+0x94/0x5c0
[ 1720.836468]  [<ffffffffbd25dece>] ? __do_page_fault+0x25e/0x4c0
[ 1720.836473]  [<ffffffffbd3ffba9>] SyS_ioctl+0x79/0x90
[ 1720.836477]  [<ffffffffbd769ef7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1720.836482] INFO: task receive_writer:22544 blocked for more than 120 seconds.
[ 1720.836549]       Tainted: P           O    4.9.58 #1-NixOS
[ 1720.836600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1720.836670] receive_writer  D    0 22544      2 0x00000000
[ 1720.836674]  ffff95eddb621800 0000000000000000 ffff95ee2fb97ec0 ffff95ee0bf56a00
[ 1720.836679]  ffff95ee0bfc0000 ffff9cc8839a3b30 ffffffffbd765082 0000000000000000
[ 1720.836683]  ffffffffc03b95d0 00ffffffc03cf1d0 0000000000000001 ffff95ee0bf56a00
[ 1720.836687] Call Trace:
[ 1720.836693]  [<ffffffffbd765082>] ? __schedule+0x192/0x660
[ 1720.836698]  [<ffffffffbd765586>] schedule+0x36/0x80
[ 1720.836706]  [<ffffffffc017caeb>] spl_panic+0xeb/0x100 [spl]
[ 1720.836748]  [<ffffffffc030d3c2>] ? __set_error+0x22/0x30 [zfs]
[ 1720.836790]  [<ffffffffc029a107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1720.836827]  [<ffffffffc029a443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1720.836830]  [<ffffffffbd7671e2>] ? mutex_lock+0x12/0x30
[ 1720.836880]  [<ffffffffc02c984b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1720.836918]  [<ffffffffc029a20b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1720.836954]  [<ffffffffc0284a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1720.836988]  [<ffffffffc0284b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1720.837024]  [<ffffffffc0291858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1720.837058]  [<ffffffffc0291cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1720.837062]  [<ffffffffbd3ce021>] ? __slab_free+0xa1/0x2e0
[ 1720.837068]  [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1720.837074]  [<ffffffffc0176f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1720.837080]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1720.837113]  [<ffffffffc0291910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1720.837118]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1720.837123]  [<ffffffffc0179642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1720.837127]  [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1720.837131]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1720.837135]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1720.837139]  [<ffffffffbd76a155>] ret_from_fork+0x25/0x30
tcaputi commented 6 years ago

Hmmm. You definitely should be seeing a message with error 17 in it.... My only explanation is that the log could be cut off. Maybe you can increase the debug log to 1G temporarily by doing the following. Really sorry to ask you to keep crashing your machine but I'm still not able to reproduce it locally yet:

echo 1073741824> /sys/module/zfs/parameters/zfs_dbgmsg_maxsize
sjau commented 6 years ago

So what commands in what order to execute? The server is still for testing... real server still runs mdadm raid1, luks-encrypted debian with ext 4 and uses rsync for backup ;)

tcaputi commented 6 years ago
echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
echo 1073741824 > /sys/module/zfs/parameters/zfs_dbgmsg_maxsize
echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
# cause the panic
cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'

This will use up 1G of system ram for debug messages, by the way so you might want to tune zfs_dbgmsg_maxsize back once we have everything all figured out (or rebooting will clear the value).

sjau commented 6 years ago

not issuing the echo 4294967263 > /sys/module/zfs/parameters/zfs_flags command?

tcaputi commented 6 years ago

Yes. I'm sorry. Thanks for pointing that out. I updated the commands above.

sjau commented 6 years ago

so restart over again :) btw missing space for the dbgmsg_maxsize before the > :)

sjau commented 6 years ago

Now we have output:

root@servi-nixos:~# zfs destroy -r serviTank/encZFS/BU/subi/Nixos
cannot open 'serviTank/encZFS/BU/subi/Nixos': dataset does not exist
root@servi-nixos:~# zfs list
NAME                       USED  AVAIL  REFER  MOUNTPOINT
serviTank                 90.1G  3.42T    96K  /serviTank
serviTank/encZFS          55.5G  3.42T  1.39M  none
serviTank/encZFS/BU       2.78M  3.42T  1.39M  none
serviTank/encZFS/BU/subi  1.39M  3.42T  1.39M  none
serviTank/encZFS/Nixos    55.5G  3.42T  10.7G  legacy
root@servi-nixos:~# echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
root@servi-nixos:~# echo 1073741824 > /sys/module/zfs/parameters/zfs_dbgmsg_maxsize
root@servi-nixos:~# echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
root@servi-nixos:~# dmesg | tail -n 40
[   77.492055] audit: type=1300 audit(1510001435.468:38): arch=c000003e syscall=54 success=yes exit=0 a0=4 a1=29 a2=40 a3=15d55c0 items=0 ppid=6321 pid=6550 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ip6tables" exe="/nix/store/hnwjdbj5w41sxw60cf4nfa93z8cr59nb-iptables-1.6.1/bin/xtables-multi" key=(null)
[   77.492057] audit: type=1327 audit(1510001435.468:38): proctitle=6970367461626C6573002D77002D7400726177002D4E006E69786F732D66772D727066696C746572
[   77.575067] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
[   80.088561] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[   80.917907] systemd-journald[6236]: Failed to set ACL on /var/log/journal/cc00e28c9c9241ad97b93019da77628d/user-1000.journal, ignoring: Operation not supported
[   90.931897] wireguard: WireGuard 0.0.20171017 loaded. See www.wireguard.com for information.
[   90.931898] wireguard: Copyright (C) 2015-2017 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[ 1244.050819] VERIFY3(0 == dmu_object_dirty_raw(os, object, tx)) failed (0 == 17)
[ 1244.050842] PANIC at dmu.c:937:dmu_free_long_object_impl()
[ 1244.050854] Showing stack for process 8103
[ 1244.050856] CPU: 6 PID: 8103 Comm: receive_writer Tainted: P           O    4.9.58 #1-NixOS
[ 1244.050856] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[ 1244.050857]  ffffa13d43ac7b38 ffffffff842f7a12 ffffffffc057ce03 00000000000003a9
[ 1244.050859]  ffffa13d43ac7b48 ffffffffc032d9f2 ffffa13d43ac7cd0 ffffffffc032dac5
[ 1244.050860]  0000000000000028 ffffa13d00000030 ffffa13d43ac7ce0 ffffa13d43ac7c80
[ 1244.050861] Call Trace:
[ 1244.050865]  [<ffffffff842f7a12>] dump_stack+0x63/0x81
[ 1244.050869]  [<ffffffffc032d9f2>] spl_dumpstack+0x42/0x50 [spl]
[ 1244.050870]  [<ffffffffc032dac5>] spl_panic+0xc5/0x100 [spl]
[ 1244.050890]  [<ffffffffc04bb3c2>] ? __set_error+0x22/0x30 [zfs]
[ 1244.050900]  [<ffffffffc0448107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1244.050909]  [<ffffffffc0448443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1244.050910]  [<ffffffff845671e2>] ? mutex_lock+0x12/0x30
[ 1244.050922]  [<ffffffffc047784b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1244.050931]  [<ffffffffc044820b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1244.050939]  [<ffffffffc0432a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1244.050947]  [<ffffffffc0432b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1244.050955]  [<ffffffffc043f858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1244.050963]  [<ffffffffc043fcb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1244.050965]  [<ffffffff841ce069>] ? __slab_free+0xe9/0x2e0
[ 1244.050966]  [<ffffffff840a5200>] ? set_next_entity+0x70/0x890
[ 1244.050968]  [<ffffffffc0327f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1244.050969]  [<ffffffffc032a5d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1244.050977]  [<ffffffffc043f910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1244.050978]  [<ffffffffc032a5d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1244.050980]  [<ffffffffc032a642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1244.050981]  [<ffffffff8408e457>] kthread+0xd7/0xf0
[ 1244.050981]  [<ffffffff8408e380>] ? kthread_park+0x60/0x60
[ 1244.050982]  [<ffffffff8408e380>] ? kthread_park+0x60/0x60
[ 1244.050983]  [<ffffffff8456a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
1510001732   dnode.c:1421:dnode_hold_impl(): error 17
1510002335   dnode.c:1421:dnode_hold_impl(): error 17
1510002602   dnode.c:1421:dnode_hold_impl(): error 17
1510002602   dnode.c:1421:dnode_hold_impl(): error 17
tcaputi commented 6 years ago

Wonderful. I should be able to look at this tonight. Thanks for the help.

sjau commented 6 years ago

doesn't look to me like that's of any help... but well :) thanks for your work.

tcaputi commented 6 years ago

It's the line number. That should be enough for me to figure it out but I'll post here if it's not

sjau commented 6 years ago

Good luck.

tcaputi commented 6 years ago

Oh. I think I already fixed this actually: 35df0bb5565c81ba52de430108b37eced9de0606.

It looks like thats the only encryption fix (so far) that you don't have. Sorry for all the confusion. I'll close this for now. Feel free to reopen the issue if you try the new code and see that its still broken.

sjau commented 6 years ago

Hmmm, that was fixed 19 days ago... why haven't I that one yet :) well, thanks for checking :)

sjau commented 6 years ago

Sorry for taking this long. There was a break-in and entry in the office and we had to resetup everything...

Anyway, I tested now with latest git master and still same issues (to me):

root@servi-nixos:~# zfs destroy -r serviTank/encZFS/BU/subi/Nixos
root@servi-nixos:~# zfs list
NAME                                                 USED  AVAIL  REFER  MOUNTPOINT
serviTank                                            366G  3.15T    96K  /serviTank
serviTank/encZFS                                     271G  3.15T  1.39M  none
serviTank/encZFS/BU                                  195G  3.15T  1.39M  none
serviTank/encZFS/BU/jus-law                          195G  3.15T  1.39M  none
serviTank/encZFS/BU/jus-law/Nixos-2017-11-12_05-00  1.68G  3.15T  1.68G  none
serviTank/encZFS/BU/jus-law/Nixos-2017-11-12_13-00  1.68G  3.15T  1.68G  none
serviTank/encZFS/BU/jus-law/VMs-2017-11-12_05-00    95.4G  3.15T  95.4G  none
serviTank/encZFS/BU/jus-law/VMs-2017-11-12_13-00    96.1G  3.15T  96.1G  none
serviTank/encZFS/BU/subi                            1.39M  3.15T  1.39M  none
serviTank/encZFS/Nixos                              76.5G  3.15T  14.2G  legacy
root@servi-nixos:~# echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
root@servi-nixos:~# echo 1073741824 > /sys/module/zfs/parameters/zfs_dbgmsg_maxsize
root@servi-nixos:~# echo 4294967263 > /sys/module/zfs/parameters/zfs_flags

[  144.227900] wireguard: WireGuard 0.0.20171101 loaded. See www.wireguard.com for information.
[  144.227901] wireguard: Copyright (C) 2015-2017 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[ 1455.640657] VERIFY3(0 == dmu_object_set_nlevels(rwa->os, drro->drr_object, drro->drr_nlevels, tx)) failed (0 == 22)
[ 1455.640683] PANIC at dmu_send.c:2540:receive_object()
[ 1455.640694] Showing stack for process 24166
[ 1455.640696] CPU: 0 PID: 24166 Comm: receive_writer Tainted: P           O    4.9.61 #1-NixOS
[ 1455.640696] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[ 1455.640697]  ffffb1001724fb68 ffffffffad0f7ba2 ffffffffc03f9e64 00000000000009ec
[ 1455.640699]  ffffb1001724fb78 ffffffffc01a95c2 ffffb1001724fd00 ffffffffc01a9695
[ 1455.640700]  0000000000000000 0000000000000030 ffffb1001724fd10 ffffb1001724fcb0
[ 1455.640701] Call Trace:
[ 1455.640705]  [<ffffffffad0f7ba2>] dump_stack+0x63/0x81
[ 1455.640709]  [<ffffffffc01a95c2>] spl_dumpstack+0x42/0x50 [spl]
[ 1455.640710]  [<ffffffffc01a9695>] spl_panic+0xc5/0x100 [spl]
[ 1455.640712]  [<ffffffffc01a3f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1455.640734]  [<ffffffffc02c45b3>] ? dnode_rele_and_unlock+0x53/0x80 [zfs]
[ 1455.640744]  [<ffffffffc02c4619>] ? dnode_rele+0x39/0x40 [zfs]
[ 1455.640753]  [<ffffffffc02b8d8f>] receive_object+0x51f/0x720 [zfs]
[ 1455.640762]  [<ffffffffc02c4619>] ? dnode_rele+0x39/0x40 [zfs]
[ 1455.640770]  [<ffffffffc02bbf36>] receive_writer_thread+0x3c6/0xd50 [zfs]
[ 1455.640772]  [<ffffffffacfce0a9>] ? __slab_free+0xe9/0x2e0
[ 1455.640773]  [<ffffffffacea5240>] ? set_next_entity+0x70/0x890
[ 1455.640775]  [<ffffffffc01a3f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1455.640776]  [<ffffffffc01a65d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1455.640785]  [<ffffffffc02bbb70>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1455.640786]  [<ffffffffc01a65d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1455.640787]  [<ffffffffc01a6642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1455.640788]  [<fffffffface8e497>] kthread+0xd7/0xf0
[ 1455.640789]  [<fffffffface8e3c0>] ? kthread_park+0x60/0x60
[ 1455.640790]  [<fffffffface8e3c0>] ? kthread_park+0x60/0x60
[ 1455.640791]  [<ffffffffad36a155>] ret_from_fork+0x25/0x30
[ 1598.279468] INFO: task txg_quiesce:8945 blocked for more than 120 seconds.
[ 1598.279546]       Tainted: P           O    4.9.61 #1-NixOS
[ 1598.279599] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1598.279670] txg_quiesce     D    0  8945      2 0x00000000
[ 1598.279677]  ffff9b75c992a800 0000000000000000 ffff9b75efb97e40 ffff9b75a7b4ea00
[ 1598.279683]  ffff9b75cc354f80 ffffb1001d7a3d30 ffffffffad365082 ffffb1001d7a3d00
[ 1598.279688]  0000000000000246 000000018020001f ffffb1001d7a3d50 ffff9b75a7b4ea00
[ 1598.279694] Call Trace:
[ 1598.279710]  [<ffffffffad365082>] ? __schedule+0x192/0x660
[ 1598.279716]  [<ffffffffad365586>] schedule+0x36/0x80
[ 1598.279729]  [<ffffffffc01ab888>] cv_wait_common+0x128/0x140 [spl]
[ 1598.279736]  [<ffffffffacead3d0>] ? wake_atomic_t_function+0x60/0x60
[ 1598.279744]  [<ffffffffc01ab8b5>] __cv_wait+0x15/0x20 [spl]
[ 1598.279830]  [<ffffffffc0312a13>] txg_quiesce_thread+0x2e3/0x3f0 [zfs]
[ 1598.279883]  [<ffffffffc0312730>] ? txg_wait_open+0x100/0x100 [zfs]
[ 1598.279891]  [<ffffffffc01a65d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1598.279897]  [<ffffffffc01a6642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1598.279902]  [<fffffffface8e497>] kthread+0xd7/0xf0
[ 1598.279905]  [<fffffffface8e3c0>] ? kthread_park+0x60/0x60
[ 1598.279908]  [<fffffffface8e3c0>] ? kthread_park+0x60/0x60
[ 1598.279912]  [<ffffffffad36a155>] ret_from_fork+0x25/0x30
[ 1598.279931] INFO: task zfs:15764 blocked for more than 120 seconds.
[ 1598.279990]       Tainted: P           O    4.9.61 #1-NixOS
[ 1598.280041] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1598.280112] zfs             D    0 15764  15753 0x00000000
[ 1598.280116]  ffff9b75afe30400 0000000000000000 ffff9b75efb57e40 ffff9b75c995a7c0
[ 1598.280121]  ffff9b75cc354240 ffffb10003adb8c8 ffffffffad365082 ffffffffc02a2e06
[ 1598.280126]  ffff9b72d4689090 0000000000000001 ffffb10003adb8d0 ffff9b75c995a7c0
[ 1598.280131] Call Trace:
[ 1598.280138]  [<ffffffffad365082>] ? __schedule+0x192/0x660
[ 1598.280180]  [<ffffffffc02a2e06>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1598.280186]  [<ffffffffad365586>] schedule+0x36/0x80
[ 1598.280194]  [<ffffffffc01ab888>] cv_wait_common+0x128/0x140 [spl]
[ 1598.280199]  [<ffffffffacead3d0>] ? wake_atomic_t_function+0x60/0x60
[ 1598.280205]  [<ffffffffc01ab8b5>] __cv_wait+0x15/0x20 [spl]
[ 1598.280246]  [<ffffffffc02a81d2>] bqueue_enqueue+0x62/0xe0 [zfs]
[ 1598.280294]  [<ffffffffc02bdb21>] dmu_recv_stream+0x691/0x11c0 [zfs]
[ 1598.280302]  [<ffffffffc01c462a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1598.280355]  [<ffffffffc0344d1a>] ? zfs_set_prop_nvlist+0x2fa/0x510 [zfs]
[ 1598.280404]  [<ffffffffc0345477>] zfs_ioc_recv_impl+0x407/0x1170 [zfs]
[ 1598.280472]  [<ffffffffc0346819>] zfs_ioc_recv_new+0x369/0x400 [zfs]
[ 1598.280481]  [<ffffffffc01a42cc>] ? spl_kmem_alloc_impl+0x9c/0x180 [spl]
[ 1598.280488]  [<ffffffffc01a64a9>] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1598.280495]  [<ffffffffc01c98af>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1598.280501]  [<ffffffffc01c462a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1598.280507]  [<ffffffffc01c46ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1598.280555]  [<ffffffffc034350b>] zfsdev_ioctl+0x20b/0x660 [zfs]
[ 1598.280568]  [<ffffffffacfff6e4>] do_vfs_ioctl+0x94/0x5c0
[ 1598.280575]  [<fffffffface5dece>] ? __do_page_fault+0x25e/0x4c0
[ 1598.280580]  [<ffffffffacfffc89>] SyS_ioctl+0x79/0x90
[ 1598.280585]  [<ffffffffad369ef7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1598.280595] INFO: task receive_writer:24166 blocked for more than 120 seconds.
[ 1598.280662]       Tainted: P           O    4.9.61 #1-NixOS
[ 1598.280713] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1598.280783] receive_writer  D    0 24166      2 0x00000000
[ 1598.280787]  ffff9b75c9b13000 0000000000000000 ffff9b75efa17e40 ffff9b75a7b4cf80
[ 1598.280792]  ffff9b75cbe027c0 ffffb1001724fb60 ffffffffad365082 0000000000000000
[ 1598.280797]  ffffffffc03e4728 00ffffffc03fa708 0000000000000000 ffff9b75a7b4cf80
[ 1598.280802] Call Trace:
[ 1598.280809]  [<ffffffffad365082>] ? __schedule+0x192/0x660
[ 1598.280815]  [<ffffffffad365586>] schedule+0x36/0x80
[ 1598.280823]  [<ffffffffc01a96bb>] spl_panic+0xeb/0x100 [spl]
[ 1598.280831]  [<ffffffffc01a3f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1598.280883]  [<ffffffffc02c45b3>] ? dnode_rele_and_unlock+0x53/0x80 [zfs]
[ 1598.280931]  [<ffffffffc02c4619>] ? dnode_rele+0x39/0x40 [zfs]
[ 1598.280977]  [<ffffffffc02b8d8f>] receive_object+0x51f/0x720 [zfs]
[ 1598.281023]  [<ffffffffc02c4619>] ? dnode_rele+0x39/0x40 [zfs]
[ 1598.281068]  [<ffffffffc02bbf36>] receive_writer_thread+0x3c6/0xd50 [zfs]
[ 1598.281073]  [<ffffffffacfce0a9>] ? __slab_free+0xe9/0x2e0
[ 1598.281080]  [<ffffffffacea5240>] ? set_next_entity+0x70/0x890
[ 1598.281086]  [<ffffffffc01a3f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1598.281093]  [<ffffffffc01a65d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1598.281136]  [<ffffffffc02bbb70>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1598.281142]  [<ffffffffc01a65d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1598.281149]  [<ffffffffc01a6642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1598.281152]  [<fffffffface8e497>] kthread+0xd7/0xf0
[ 1598.281156]  [<fffffffface8e3c0>] ? kthread_park+0x60/0x60
[ 1598.281158]  [<fffffffface8e3c0>] ? kthread_park+0x60/0x60
[ 1598.281163]  [<ffffffffad36a155>] ret_from_fork+0x25/0x30
modinfo zfs
filename:       /run/current-system/kernel-modules/lib/modules/4.9.61/extra/zfs/zfs.ko.xz
version:        0.7.0-1
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     FF479BCC41F37E7DD1898D9
depends:        spl,znvpair,zcommon,icp,zunicode,zavl
vermagic:       4.9.61 SMP mod_unload 

root@servi-nixos:~# modinfo spl
filename:       /run/current-system/kernel-modules/lib/modules/4.9.61/extra/spl/spl.ko.xz
version:        0.7.0-1
license:        GPL
author:         OpenZFS on Linux
description:    Solaris Porting Layer
srcversion:     E9C0243BEDADDD2716B8065
depends:        
vermagic:       4.9.61 SMP mod_unload 
behlendorf commented 6 years ago

In that case let me reopen this.

tcaputi commented 6 years ago

@sjau It seems your error is manifesting differently now. Can you provide me the output of

cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 22'

Now would also probably be a good time to mention that I am aware of some issues with raw sends which i am currently working on. I don't know if your issue is related to this, but I hope to have a PR opened to fix these issues tomorrow (although it will probably take some time before it is merged). You can follow this at #6845.

sjau commented 6 years ago

There we go:

using zfs rev. 5277f208f290ea4e2204800a66c3ba20a03fe503 and spl rev. 35a44fcb8d6e346f51be82dfe57562c2ea0c6a9c

[  126.133956] wireguard: Copyright (C) 2015-2017 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[44974.528445] refused connection: IN=eth0 OUT= MAC=d0:50:99:5c:cb:65:d8:58:d7:00:1d:a6:08:00 SRC=106.120.86.130 DST=10.0.0.3 LEN=40 TOS=0x00 PREC=0x00 TTL=37 ID=1514 PROTO=TCP SPT=34145 DPT=6882 WINDOW=1400 RES=0x00 SYN URGP=0
[61625.231168] refused connection: IN=eth0 OUT= MAC=d0:50:99:5c:cb:65:d8:58:d7:00:1d:a6:08:00 SRC=106.120.86.130 DST=10.0.0.3 LEN=40 TOS=0x00 PREC=0x00 TTL=37 ID=39254 PROTO=TCP SPT=34145 DPT=6882 WINDOW=1400 RES=0x00 SYN URGP=0
[122460.464918] VERIFY3(0 == dmu_object_set_nlevels(rwa->os, drro->drr_object, drro->drr_nlevels, tx)) failed (0 == 22)
[122460.464944] PANIC at dmu_send.c:2540:receive_object()
[122460.464996] Showing stack for process 9158
[122460.464998] CPU: 0 PID: 9158 Comm: receive_writer Tainted: P           O    4.9.61 #1-NixOS
[122460.464999] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[122460.465000]  ffffad51d01afb68 ffffffff9f8f7ba2 ffffffffc0598e64 00000000000009ec
[122460.465002]  ffffad51d01afb78 ffffffffc03485c2 ffffad51d01afd00 ffffffffc0348695
[122460.465004]  0000000000000000 0000000000000030 ffffad51d01afd10 ffffad51d01afcb0
[122460.465006] Call Trace:
[122460.465011]  [<ffffffff9f8f7ba2>] dump_stack+0x63/0x81
[122460.465016]  [<ffffffffc03485c2>] spl_dumpstack+0x42/0x50 [spl]
[122460.465018]  [<ffffffffc0348695>] spl_panic+0xc5/0x100 [spl]
[122460.465020]  [<ffffffffc0342f53>] ? spl_kmem_free+0x33/0x40 [spl]
[122460.465054]  [<ffffffffc04635b3>] ? dnode_rele_and_unlock+0x53/0x80 [zfs]
[122460.465063]  [<ffffffffc0463619>] ? dnode_rele+0x39/0x40 [zfs]
[122460.465072]  [<ffffffffc0457d8f>] receive_object+0x51f/0x720 [zfs]
[122460.465081]  [<ffffffffc0463619>] ? dnode_rele+0x39/0x40 [zfs]
[122460.465089]  [<ffffffffc045af36>] receive_writer_thread+0x3c6/0xd50 [zfs]
[122460.465091]  [<ffffffff9f7ce061>] ? __slab_free+0xa1/0x2e0
[122460.465093]  [<ffffffff9f6a5240>] ? set_next_entity+0x70/0x890
[122460.465094]  [<ffffffffc0342f53>] ? spl_kmem_free+0x33/0x40 [spl]
[122460.465095]  [<ffffffffc03455d0>] ? __thread_exit+0x20/0x20 [spl]
[122460.465104]  [<ffffffffc045ab70>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[122460.465105]  [<ffffffffc03455d0>] ? __thread_exit+0x20/0x20 [spl]
[122460.465106]  [<ffffffffc0345642>] thread_generic_wrapper+0x72/0x80 [spl]
[122460.465107]  [<ffffffff9f68e497>] kthread+0xd7/0xf0
[122460.465108]  [<ffffffff9f68e3c0>] ? kthread_park+0x60/0x60
[122460.465109]  [<ffffffff9f68e3c0>] ? kthread_park+0x60/0x60
[122460.465110]  [<ffffffff9fb6a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 22'
1510638852   zap.c:765:fzap_checksize(): error 22
1510638858   zap.c:765:fzap_checksize(): error 22
1510638863   zap.c:765:fzap_checksize(): error 22
[...]
1510639205   zap.c:765:fzap_checksize(): error 22
1510639206   zap.c:765:fzap_checksize(): error 22
1510639207   zap.c:765:fzap_checksize(): error 22
1510639207   dnode.c:1222:dnode_hold_impl(): error 22
1510639207   dnode.c:1222:dnode_hold_impl(): error 22
1510639207   dnode.c:1222:dnode_hold_impl(): error 22
[...]
1510639207   dnode.c:1222:dnode_hold_impl(): error 22
1510639207   dnode.c:1222:dnode_hold_impl(): error 22
1510639207   dnode.c:1222:dnode_hold_impl(): error 22
1510639207   dnode.c:1222:dnode_hold_impl(): error 22
1510639208   zap.c:765:fzap_checksize(): error 22
1510639209   zap.c:765:fzap_checksize(): error 22
1510639210   zap.c:765:fzap_checksize(): error 22
[...]
1510639994   zap.c:765:fzap_checksize(): error 22
1510639994   zap.c:765:fzap_checksize(): error 22
1510640005   zap.c:765:fzap_checksize(): error 22
1510640009   dnode.c:1712:dnode_set_nlevels(): error 22
1510640009   dnode.c:1712:dnode_set_nlevels(): error 22
1510640011   zap.c:765:fzap_checksize(): error 22

Btw, is there a way to send encrypted datasets not in raw mode? I know raw mode enables me to send an encrypted DS to any point and be sure to have it encrypted as well.

However, if the receiver has the top ds also encrypted, then newly created dataset should also be encrypted right? Hence in that case, raw mode seems not to be necessary to ensure that the child dataset will also be encrypted. However I couldn't figure that out.

tcaputi commented 6 years ago

Btw, is there a way to send encrypted datasets not in raw mode?

Yes simply load the key and use zfs send without -w. Unfortunately though, for the moment -p and -R imply -w.

However, if the receiver has the top ds also encrypted, then newly created dataset should also be encrypted right?

Correct. Raw send is useful if you want to backup to an untrusted server in such a way that they cannot read your data, but you can if they give it back to you. It basically just takes the encrypted ddata that is on-disk and transfers it directly as-is to the new destination.

sjau commented 6 years ago

So no incremental sending possbile without raw mode.

Btw, also had a read at https://github.com/zfsonlinux/zfs/pull/6864 and https://github.com/zfsonlinux/zfsonlinux.github.com/pull/35/files#diff-5c212f01f7fcb3ccece492a2f81b17caR202 - thanks for all the hard work you put in.

tcaputi commented 6 years ago

So no incremental sending possbile without raw mode.

Incrementals are still possible. If you are having issues with them please let me know.

OK. Now I should just need one last bit of info to debug this. Sorry for making you do all the work, but I still haven't been able to reproduce this locally.

Please apply this diff:

diff --git a/module/zfs/dnode.c b/module/zfs/dnode.c
index c1fbf3c..dc08b9a 100644
--- a/module/zfs/dnode.c
+++ b/module/zfs/dnode.c
@@ -1709,6 +1709,9 @@ dnode_set_nlevels(dnode_t *dn, int nlevels, dmu_tx_t *tx)
                ret = 0;
                goto out;
        } else if (nlevels < dn->dn_nlevels) {
+               zfs_dbgmsg("HERE: obj = %llu nlevels = %d, dn_nlevels = %d",
+                   (unsigned long long)dn->dn_object,
+                   (int)nlevels, (int)dn->dn_nlevels);
                ret = SET_ERROR(EINVAL);
                goto out;
        }

This simply adds a debugging line that will show up in /proc/spl/kstat/zfs/dbgmsg. Once you have applied the diff, rebuilt the module and set the tunables, and installed it, cause the crash again. This time you should look for the new message with: cat /proc/spl/kstat/zfs/dbgmsg | grep 'HERE'. Paste the output of that here.

You will notice that the output will include obj = <some number>. Take this number and enter it into the following command, replacing the <dataset@snapshot> and <number> values:

zfs send <dataset@snapshot> | zstreamdump -v | grep -w 'OBJECT' | grep 'object = <number>'

Then provide the output of that command as well. Thanks again and hopefully that will be the last bit of info I need.

sjau commented 6 years ago

I'll have to see how I can apply this patch.... on NixOS it doesn't seem to be quite as trivial :(

tcaputi commented 6 years ago

If you can find the dnode_set_nlevels() function in dnode.c you should be able to just insert the lines indicated with + signs manually. (you will need to remove the + signs from the code to get it to compile.

sjau commented 6 years ago

The problem is nixos runs declaratively and I'm not sure if I movidy the source myself that it won't pull default from the repos.... I tried to advance the revisions used in a declarative mode but was not successfull. The way I succeeded was to mount the /nix/store in writeable mode, then find the zfs and spl default.nix file, alter the revision and sha sum directly there and have it rebuild without upgrade... that way it pulled the new revision sources...

sjau commented 6 years ago

Hmmm, couldn't apply the patch. Maybe I need another revision than the one I use.... I also assume it's for zfs and not spl?

applying patch /nix/store/q2pc8flj0l9a89izb0rgvr7jw1xcakp5-zfs_debug.patch
patching file module/zfs/dnode.c
Hunk #1 FAILED at 1709.
1 out of 1 hunk FAILED -- saving rejects to file module/zfs/dnode.c.rej
builder for ‘/nix/store/s440np4z3cm4fd7045wkhy7d4qqhir1a-zfs-kernel-2017-11-12-4.9.61.drv’ failed with exit code 1
tcaputi commented 6 years ago

The patch is for zfs. Can you find the file module/zfs/dnode.c? You should be able to open that file and find this code:

int
dnode_set_nlevels(dnode_t *dn, int nlevels, dmu_tx_t *tx)
{
    int ret = 0;

    rw_enter(&dn->dn_struct_rwlock, RW_WRITER);

    if (dn->dn_nlevels == nlevels) {
        ret = 0;
        goto out;
    } else if (nlevels < dn->dn_nlevels) {
        ret = SET_ERROR(EINVAL);
        goto out;
    }

    dnode_set_nlevels_impl(dn, nlevels, tx);

out:
    rw_exit(&dn->dn_struct_rwlock);
    return (ret);
}

The you change it to this:

int
dnode_set_nlevels(dnode_t *dn, int nlevels, dmu_tx_t *tx)
{
    int ret = 0;

    rw_enter(&dn->dn_struct_rwlock, RW_WRITER);

    if (dn->dn_nlevels == nlevels) {
        ret = 0;
        goto out;
    } else if (nlevels < dn->dn_nlevels) {
        ret = SET_ERROR(EINVAL);
                zfs_dbgmsg("HERE: obj = %llu nlevels = %d, dn_nlevels = %d",
                    (unsigned long long)dn->dn_object,
                    (int)nlevels, (int)dn->dn_nlevels);
        goto out;
    }

    dnode_set_nlevels_impl(dn, nlevels, tx);

out:
    rw_exit(&dn->dn_struct_rwlock);
    return (ret);
}

Although Im wondering if you are looking at the right code... Your dbgmsg error pointed at exactly the correct line number for me.

sjau commented 6 years ago

well, I used:

  zfsUnstable = common {
    # comment/uncomment if breaking kernel versions are known
    incompatibleKernelVersion = null;

    # this package should point to a version / git revision compatible with the latest kernel release
    version = "2017-11-12";

    rev = "5277f208f290ea4e2204800a66c3ba20a03fe503";
    sha256 = "0hhlhv4g678j1w45813xfrk8zza0af59cdkmib9bkxy0cn0jsnd6";
    isUnstable = true;

    extraPatches = [
      (fetchpatch {
        url = "https://www.sjau.ch/zfs_debug.patch";
        sha256 = "0cvq6g8yjgki32yrlzy3d95c695a82lmy1vx86flhil7ny1222b5";
      })
      (fetchpatch {
        url = "https://github.com/Mic92/zfs/compare/ded8f06a3cfee...nixos-zfs-2017-09-12.patch";
        sha256 = "033wf4jn0h0kp0h47ai98rywnkv5jwvf3xwym30phnaf8xxdx8aj";
      })
    ];

    spl = splUnstable;
  };
}
tcaputi commented 6 years ago

Oh I just remembered. github changes tabs to spaces on diffs which screws with git apply. Try this: https://pastebin.com/AHncmrBR

sjau commented 6 years ago

it's building now :)

sjau commented 6 years ago

So, successfully crashed it again:

root@servi-nixos:~# echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
root@servi-nixos:~# echo 1073741824 > /sys/module/zfs/parameters/zfs_dbgmsg_maxsize
root@servi-nixos:~# echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'HERE'
1510685051   dnode.c:1715:dnode_set_nlevels(): HERE: obj = 393261 nlevels = 1, dn_nlevels = 2                                                                                                                                                                                  
1510685051   dnode.c:1715:dnode_set_nlevels(): HERE: obj = 393261 nlevels = 1, dn_nlevels = 2 

You said to run now:

zfs send <dataset@snapshot> | zstreamdump -v | grep -w 'OBJECT' | grep 'object = 39261'

So, you want me to reboot the crashed server, remove the data set from it that was just sent and cause the crash. And then from the sending server to re-initiate zfs/send with the above command? Shouldn't that be outputted finally to /dev/null or something?

Or do you want me on the currently crashed server to run that command?

tcaputi commented 6 years ago

So, you want me to reboot the crashed server, remove the data set from it that was just sent and cause the crash. And then from the sending server to re-initiate zfs/send with the above command? Shouldn't that be outputted finally to /dev/null or something? Or do you want me on the currently crashed server to run that command?

This command should be run on the send side machine. You should be able to run it even if the server is crashed (I think), but if not just reboot it. zstreamdump will just take the output of zfs send and make it human readable. The output doesn't need to be redirected anywhere. The zfs send command before the first pipe should just match the send command you used to cause the crash.

I also just wanted to confirm that this dataset is using the default recordsize?

sjau commented 6 years ago

ok, it's doing it's thing... will be another 20-30 minutes

sjau commented 6 years ago

There we go:

root@subi:~# zfs send -wR tank/encZFS/Nixos@zfs-auto-snap_hourly-2017-11-14-19h00 | zstreamdump -v | grep -w 'OBJECT' | grep 'object = 39261'
OBJECT object = 392610 type = 19 bonustype = 44 blksz = 512 bonuslen = 312 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392611 type = 19 bonustype = 44 blksz = 512 bonuslen = 300 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392612 type = 19 bonustype = 44 blksz = 512 bonuslen = 298 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392613 type = 19 bonustype = 44 blksz = 512 bonuslen = 298 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392614 type = 19 bonustype = 44 blksz = 512 bonuslen = 298 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392615 type = 19 bonustype = 44 blksz = 512 bonuslen = 295 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392616 type = 19 bonustype = 44 blksz = 512 bonuslen = 299 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392617 type = 19 bonustype = 44 blksz = 512 bonuslen = 295 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392618 type = 19 bonustype = 44 blksz = 512 bonuslen = 302 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392619 type = 19 bonustype = 44 blksz = 512 bonuslen = 303 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1