Closed sjau closed 6 years ago
@sjau Have you tried this with the latest code from master? I recently fixed a good number of these bugs a couple weeks ago. If you are on the latest code, could you provide the send file that is causing the panic? Thanks a lot.
Not sure what Nixos unstable uses... I'll have to find out.
What do you mean by providing the send file that is causing the panic?
If have don't have a copy yet on the remote server and use the -R flag it causes panic. If I just use -w flag it works for the inital dataset (snapshot) sending. But then when I try to send an incremental snapshot it will panic again.
Smallest dataset is ~ 95GB
Ok, already found it out... I use the unstable version because of encryption.
Currently Nixos uses for ZFS unstable commit 7670f721fc82e6cdcdd31f83760a79b6f2f2b998
Checked it on server and on notebook - both use same revision - https://github.com/NixOS/nixpkgs/blob/master/pkgs/os-specific/linux/zfs/default.nix#L164
Not sure what Nixos unstable uses... I'll have to find out.
Looking at the stack trace it seems that you may(?) have the patch I was talking about after all. I'll try to look into this. It would be really helpful if you could provide a minimal set of commands to reproduce the problem if possible. If not, I'll do my best to figure it out on my own
What do you mean by providing the send file that is causing the panic?
I meant the output of the zfs send
command you are using. But if your dataset is 95GB this might not be feasible.
I can test it with a smaller dataset just to see how it goes.
Actually, it would be really helpful if you could cause it to crash one more time with the following debugging on the receiving side:
echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
# perform the send, triggering the panic
Then provide the output of cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
The commands are simple:
If dataset does not exist, I can use:
zfs send -w "${srcDS}@${dsPrefix}_${now}" | ssh "${dstSrv}" "zfs receive ${dstDS}/${dsName}
That will create the dataset nicely on the destination server - alas I haven't tried to mount it and check it... However, if I use the -R option, it causes panic:
zfs send -wR "${srcDS}@${dsPrefix}_${now}" | ssh "${dstSrv}" "zfs receive ${dstDS}/${dsName}
Also for incremental:
If I didn't use the -R option for initial dataset creation on the server and then try to send an incremental dataset like below, it also causes panic:
zfs send -wi "#${incDS}" "${srcDS}@${dsPrefix}_${now}" | ssh "${dstSrv}" "zfs receive ${dstDS}/${dsName}"
Shall I crash it with initial dataset (-R) or incremental set?
The commands are simple:
We actually already have tests for these commands in our test suite (which is run against every pull request). This is probably a problem that is triggered by the data in your dataset.
Shall I crash it with initial dataset (-R) or incremental set?
The initial one please. Thanks a lot.
Doing so now... will take a few minutes to send 95GB or so....
FYI: The script I created to take snapshots and bookmarks and send them: https://paste.simplylinux.ch/view/raw/d7809481
Nothing found
[189601.664736] Showing stack for process 3880
[189601.664737] CPU: 7 PID: 3880 Comm: receive_writer Tainted: P O 4.9.58 #1-NixOS
[189601.664738] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[189601.664739] ffffbd1250b2fb38 ffffffffa88f7a12 ffffffffc02ffe03 00000000000003a9
[189601.664740] ffffbd1250b2fb48 ffffffffc00b09f2 ffffbd1250b2fcd0 ffffffffc00b0ac5
[189601.664741] 0000000000000000 ffffbd1200000030 ffffbd1250b2fce0 ffffbd1250b2fc80
[189601.664742] Call Trace:
[189601.664747] [<ffffffffa88f7a12>] dump_stack+0x63/0x81
[189601.664752] [<ffffffffc00b09f2>] spl_dumpstack+0x42/0x50 [spl]
[189601.664753] [<ffffffffc00b0ac5>] spl_panic+0xc5/0x100 [spl]
[189601.664774] [<ffffffffc01a9c26>] ? dbuf_rele+0x36/0x40 [zfs]
[189601.664784] [<ffffffffc01cb107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[189601.664792] [<ffffffffc01cb443>] ? dnode_setdirty+0x83/0x100 [zfs]
[189601.664793] [<ffffffffa8b671e2>] ? mutex_lock+0x12/0x30
[189601.664805] [<ffffffffc01fa84b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[189601.664814] [<ffffffffc01cb20b>] ? dnode_hold+0x1b/0x20 [zfs]
[189601.664823] [<ffffffffc01b5a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[189601.664831] [<ffffffffc01b5b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[189601.664840] [<ffffffffc01c2858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[189601.664848] [<ffffffffc01c2cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[189601.664849] [<ffffffffa87ce021>] ? __slab_free+0xa1/0x2e0
[189601.664851] [<ffffffffa86a5200>] ? set_next_entity+0x70/0x890
[189601.664852] [<ffffffffc00aaf53>] ? spl_kmem_free+0x33/0x40 [spl]
[189601.664854] [<ffffffffc00ad5d0>] ? __thread_exit+0x20/0x20 [spl]
[189601.664862] [<ffffffffc01c2910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[189601.664863] [<ffffffffc00ad5d0>] ? __thread_exit+0x20/0x20 [spl]
[189601.664864] [<ffffffffc00ad642>] thread_generic_wrapper+0x72/0x80 [spl]
[189601.664865] [<ffffffffa868e457>] kthread+0xd7/0xf0
[189601.664866] [<ffffffffa868e380>] ? kthread_park+0x60/0x60
[189601.664867] [<ffffffffa8b6a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
root@servi-nixos:~#
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg > /tmp/dbgmsg
root@servi-nixos:~# ls -al /tmp/dbgmsg
-rw-r--r-- 1 root root 3462117 Nov 6 20:35 /tmp/dbgmsg
The whole dbgmsg output can be viewed here https://paste.simplylinux.ch/view/raw/65e6bc7c
@sjau
I'm sorry. I forgot a step that you need to run on non-debug builds. Unfortunately you will need to restart the receiving machine and run this immediately after echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
Still no output:
root@servi-nixos:~# zfs destroy -r serviTank/encZFS/BU/subi/Nixos
root@servi-nixos:~# echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
root@servi-nixos:~# echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
root@servi-nixos:~# dmesg | tail -n 20
[ 1314.303517] [<ffffffffc029a107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1314.303526] [<ffffffffc029a443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1314.303528] [<ffffffffbd7671e2>] ? mutex_lock+0x12/0x30
[ 1314.303540] [<ffffffffc02c984b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1314.303549] [<ffffffffc029a20b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1314.303557] [<ffffffffc0284a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1314.303565] [<ffffffffc0284b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1314.303573] [<ffffffffc0291858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1314.303581] [<ffffffffc0291cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1314.303583] [<ffffffffbd3ce021>] ? __slab_free+0xa1/0x2e0
[ 1314.303584] [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1314.303586] [<ffffffffc0176f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1314.303587] [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1314.303595] [<ffffffffc0291910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1314.303596] [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1314.303597] [<ffffffffc0179642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1314.303598] [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1314.303599] [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1314.303599] [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1314.303600] [<ffffffffbd76a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
root@servi-nixos:~# dmesg | tail -n 30
[ 1314.303473] CPU: 6 PID: 22544 Comm: receive_writer Tainted: P O 4.9.58 #1-NixOS
[ 1314.303473] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[ 1314.303474] ffff9cc8839a3b38 ffffffffbd4f7a12 ffffffffc03cee03 00000000000003a9
[ 1314.303476] ffff9cc8839a3b48 ffffffffc017c9f2 ffff9cc8839a3cd0 ffffffffc017cac5
[ 1314.303477] 0000000000000028 ffff9cc800000030 ffff9cc8839a3ce0 ffff9cc8839a3c80
[ 1314.303478] Call Trace:
[ 1314.303482] [<ffffffffbd4f7a12>] dump_stack+0x63/0x81
[ 1314.303486] [<ffffffffc017c9f2>] spl_dumpstack+0x42/0x50 [spl]
[ 1314.303487] [<ffffffffc017cac5>] spl_panic+0xc5/0x100 [spl]
[ 1314.303507] [<ffffffffc030d3c2>] ? __set_error+0x22/0x30 [zfs]
[ 1314.303517] [<ffffffffc029a107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1314.303526] [<ffffffffc029a443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1314.303528] [<ffffffffbd7671e2>] ? mutex_lock+0x12/0x30
[ 1314.303540] [<ffffffffc02c984b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1314.303549] [<ffffffffc029a20b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1314.303557] [<ffffffffc0284a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1314.303565] [<ffffffffc0284b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1314.303573] [<ffffffffc0291858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1314.303581] [<ffffffffc0291cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1314.303583] [<ffffffffbd3ce021>] ? __slab_free+0xa1/0x2e0
[ 1314.303584] [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1314.303586] [<ffffffffc0176f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1314.303587] [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1314.303595] [<ffffffffc0291910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1314.303596] [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1314.303597] [<ffffffffc0179642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1314.303598] [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1314.303599] [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1314.303599] [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1314.303600] [<ffffffffbd76a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
root@servi-nixos:~#
Hmmmm. what about grepping just for "error". Sorry for all the trouble.
there's tons of them:
https://paste.simplylinux.ch/view/raw/30b64aec
No need to be sorry :) You've created the encryption and provide help... so no need to be sorry...
Is the error reoprted in dmesg
still failed (0 == 17)
(the top line of the stack trace is cut off above).
Here's more dmesg output:
[ 0.000000] x86/fpu: Using 'eager' FPU context switches.
[ 0.000000] BIOS-e820: [mem 0x00000000bd3c4000-0x00000000bd451fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed03fff] reserved
[ 0.000000] 00000-9FFFF write-back
[ 0.000000] 4 base 00BF000000 mask 7FFF000000 uncachable
[ 0.000000] BRK [0x13450b000, 0x13450bfff] PGTABLE
[ 0.000000] ACPI: Early table checksum verification disabled
[ 0.000000] ACPI: HPET 0x00000000BD5B3510 000038 (v01 ALASKA A M I 01072009 AMI. 00000005)
[ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.000000] node 0: [mem 0x00000000bdfff000-0x00000000bdffffff]
[ 0.000000] Normal zone: 7536128 pages, LIFO batch:31
[ 0.000000] ACPI: IRQ9 used by override.
[ 0.000000] PM: Registered nosave memory: [mem 0xaae99000-0xab36cfff]
[ 0.000000] PM: Registered nosave memory: [mem 0xfec00000-0xfec00fff]
[ 0.000000] e820: [mem 0xcf200000-0xefffffff] available for PCI devices
c56v2h-nixos-system-servi-nixos-18.03pre118381.4068703502/init loglevel=4
[ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
[ 0.017115] CPU: Processor Core ID: 0
[ 0.035611] ... generic registers: 4
[ 0.480112] NET: Registered protocol family 16
[ 0.493460] ACPI: Added _OSI(Processor Aggregator Device)
[ 0.500638] ACPI: Power Resource [PG01] (on)
[ 0.509007] pci_bus 0000:00: root bus resource [mem 0x000e4000-0x000e7fff window]
[ 0.509339] pci 0000:00:14.0: System wakeup disabled by ACPI
[ 0.509775] pci 0000:00:1b.0: [8086:8ca0] type 00 class 0x040300
[ 0.510310] pci 0000:00:1f.0: [8086:8cc4] type 00 class 0x060100
[ 0.510885] pci 0000:00:1c.2: PCI bridge to [bus 02-03] (subtractive decode)
[ 0.511003] pci 0000:02:00.0: bridge window [mem 0x000d0000-0x000d3fff window] (subtractive decode)
[ 0.511801] ACPI: PCI Interrupt Link [LNKG] (IRQs *3 4 5 6 10 11 12 14 15)
[ 0.514591] e820: reserve RAM buffer [mem 0x82fe00000-0x82fffffff]
[ 0.520322] system 00:03: Plug and Play ACPI device, IDs INT3f0d PNP0c02 (active)
[ 0.520900] system 00:07: Plug and Play ACPI device, IDs PNP0c02 (active)
[ 0.526565] pci_bus 0000:02: resource 8 [mem 0x000d4000-0x000d7fff window]
[ 0.527013] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[ 0.645616] intel_idle: MWAIT substates: 0x42120
[ 1.645668] tsc: Refined TSC clocksource calibration: 3399.013 MHz
[ 2.841518] rtc_cmos 00:02: rtc core: registered rtc_cmos as rtc0
[ 2.868667] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 2.880924] xhci_hcd 0000:00:14.0: hcc params 0x200077c1 hci version 0x100 quirks 0x00009810
[ 2.893645] ata1: SATA max UDMA/133 abar m2048@0xefc39000 port 0xefc39100 irq 26
[ 3.204474] ata3.00: configured for UDMA/133
[ 3.269092] sd 2:0:0:0: [sdc] 4096-byte physical blocks
[ 3.321525] sd 0:0:0:0: [sda] Attached SCSI disk
[ 54.903311] NET: Registered protocol family 10
[ 58.426044] loop: module loaded
[ 59.727261] snd_hda_codec_realtek hdaudioC1D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
[ 59.739982] input: HDA Intel PCH Front Headphone as /devices/pci0000:00/0000:00:1b.0/sound/card1/input12
[ 60.475133] Console: switching to colour dummy device 80x25
[ 60.572747] audit: type=1130 audit(1509997934.560:7): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-remount-fs comm="systemd" exe="/nix/store/dbay1civ36b8hkr26wsxh586r331wy94-systemd-234/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 62.495233] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
96F6E3A20
[ 1314.303437] VERIFY3(0 == dmu_object_dirty_raw(os, object, tx)) failed (0 == 17)
[ 1314.303482] [<ffffffffbd4f7a12>] dump_stack+0x63/0x81
[ 1314.303540] [<ffffffffc02c984b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1314.303584] [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1314.303599] [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1469.502873] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 1475.097656] ffff95ee0c354240 ffff9cc890f0fd30 ffffffffbd765082 ffffffffc030cf52
[ 1475.097747] [<ffffffffc017ecb8>] cv_wait_common+0x128/0x140 [spl]
[ 1475.097866] [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1475.098077] ffff95ee0c351a80 ffff9cc883c878c8 ffffffffbd765082 0000000000000000
[ 1475.098114] [<ffffffffc017ece5>] __cv_wait+0x15/0x20 [spl]
[ 1475.098348] [<ffffffffc01976ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1475.098735] Call Trace:
[ 1475.099203] [<ffffffffc0291cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1597.966287] txg_quiesce D 0 455 2 0x00000000
[ 1597.966346] [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1597.966405] 0000000000000000 000000000003328e 0000000000000400 ffff95eddcab6a00
[ 1597.966431] [<ffffffffc019762a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1597.966467] [<ffffffffc01976ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1597.966511] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1597.966536] [<ffffffffc017caeb>] spl_panic+0xeb/0x100 [spl]
[ 1597.966585] [<ffffffffc029a20b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1597.966620] [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1597.966634] [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1597.966681] ffff95ee0c351a80 ffff9cc88326ba90 ffffffffbd765082 ffffffffc030cf52
[ 1597.966699] [<ffffffffc017ece5>] __cv_wait+0x15/0x20 [spl]
[ 1597.966764] [<ffffffffc02a4fe8>] dsl_dataset_snapshot+0x138/0x2a0 [zfs]
[ 1597.966782] [<ffffffffc031705c>] zfs_ioc_snapshot+0x32c/0x390 [zfs]
[ 1720.835451] INFO: task txg_quiesce:455 blocked for more than 120 seconds.
[ 1720.835664] ffff95ee0c354240 ffff9cc890f0fd30 ffffffffbd765082 ffffffffc030cf52
[ 1720.835752] [<ffffffffbd765586>] schedule+0x36/0x80
[ 1720.835865] [<ffffffffc02e8350>] ? txg_wait_open+0x100/0x100 [zfs]
[ 1720.835871] [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1720.835877] [<ffffffffc0179642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1720.835881] [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1720.835884] [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1720.835888] [<ffffffffbd76a155>] ret_from_fork+0x25/0x30
[ 1720.835907] INFO: task zfs:12024 blocked for more than 120 seconds.
[ 1720.835965] Tainted: P O 4.9.58 #1-NixOS
[ 1720.836016] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1720.836086] zfs D 0 12024 11990 0x00000000
[ 1720.836090] ffff95eddb621800 0000000000000000 ffff95ee2fa97ec0 ffff95eddcab6a00
[ 1720.836095] ffff95ee0c351a80 ffff9cc883c878c8 ffffffffbd765082 0000000000000000
[ 1720.836099] 0000000000000000 000000000003328e 0000000000000400 ffff95eddcab6a00
[ 1720.836104] Call Trace:
[ 1720.836110] [<ffffffffbd765082>] ? __schedule+0x192/0x660
[ 1720.836114] [<ffffffffbd765586>] schedule+0x36/0x80
[ 1720.836121] [<ffffffffc017ecb8>] cv_wait_common+0x128/0x140 [spl]
[ 1720.836125] [<ffffffffbd2ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1720.836131] [<ffffffffc017ece5>] __cv_wait+0x15/0x20 [spl]
[ 1720.836165] [<ffffffffc027df92>] bqueue_enqueue+0x62/0xe0 [zfs]
[ 1720.836205] [<ffffffffc02938c1>] dmu_recv_stream+0x691/0x11c0 [zfs]
[ 1720.836211] [<ffffffffc019762a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1720.836256] [<ffffffffc031a8fa>] ? zfs_set_prop_nvlist+0x2fa/0x510 [zfs]
[ 1720.836296] [<ffffffffc031b057>] zfs_ioc_recv_impl+0x407/0x1170 [zfs]
[ 1720.836337] [<ffffffffc031c3f9>] zfs_ioc_recv_new+0x369/0x400 [zfs]
[ 1720.836343] [<ffffffffc01772cc>] ? spl_kmem_alloc_impl+0x9c/0x180 [spl]
[ 1720.836373] [<ffffffffc01794a9>] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1720.836382] [<ffffffffc019c8af>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1720.836390] [<ffffffffc019762a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1720.836399] [<ffffffffc01976ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1720.836451] [<ffffffffc03190eb>] zfsdev_ioctl+0x20b/0x660 [zfs]
[ 1720.836462] [<ffffffffbd3ff604>] do_vfs_ioctl+0x94/0x5c0
[ 1720.836468] [<ffffffffbd25dece>] ? __do_page_fault+0x25e/0x4c0
[ 1720.836473] [<ffffffffbd3ffba9>] SyS_ioctl+0x79/0x90
[ 1720.836477] [<ffffffffbd769ef7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1720.836482] INFO: task receive_writer:22544 blocked for more than 120 seconds.
[ 1720.836549] Tainted: P O 4.9.58 #1-NixOS
[ 1720.836600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1720.836670] receive_writer D 0 22544 2 0x00000000
[ 1720.836674] ffff95eddb621800 0000000000000000 ffff95ee2fb97ec0 ffff95ee0bf56a00
[ 1720.836679] ffff95ee0bfc0000 ffff9cc8839a3b30 ffffffffbd765082 0000000000000000
[ 1720.836683] ffffffffc03b95d0 00ffffffc03cf1d0 0000000000000001 ffff95ee0bf56a00
[ 1720.836687] Call Trace:
[ 1720.836693] [<ffffffffbd765082>] ? __schedule+0x192/0x660
[ 1720.836698] [<ffffffffbd765586>] schedule+0x36/0x80
[ 1720.836706] [<ffffffffc017caeb>] spl_panic+0xeb/0x100 [spl]
[ 1720.836748] [<ffffffffc030d3c2>] ? __set_error+0x22/0x30 [zfs]
[ 1720.836790] [<ffffffffc029a107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1720.836827] [<ffffffffc029a443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1720.836830] [<ffffffffbd7671e2>] ? mutex_lock+0x12/0x30
[ 1720.836880] [<ffffffffc02c984b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1720.836918] [<ffffffffc029a20b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1720.836954] [<ffffffffc0284a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1720.836988] [<ffffffffc0284b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1720.837024] [<ffffffffc0291858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1720.837058] [<ffffffffc0291cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1720.837062] [<ffffffffbd3ce021>] ? __slab_free+0xa1/0x2e0
[ 1720.837068] [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1720.837074] [<ffffffffc0176f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1720.837080] [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1720.837113] [<ffffffffc0291910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1720.837118] [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1720.837123] [<ffffffffc0179642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1720.837127] [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1720.837131] [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1720.837135] [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1720.837139] [<ffffffffbd76a155>] ret_from_fork+0x25/0x30
Hmmm. You definitely should be seeing a message with error 17
in it.... My only explanation is that the log could be cut off. Maybe you can increase the debug log to 1G temporarily by doing the following. Really sorry to ask you to keep crashing your machine but I'm still not able to reproduce it locally yet:
echo 1073741824> /sys/module/zfs/parameters/zfs_dbgmsg_maxsize
So what commands in what order to execute? The server is still for testing... real server still runs mdadm raid1, luks-encrypted debian with ext 4 and uses rsync for backup ;)
echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
echo 1073741824 > /sys/module/zfs/parameters/zfs_dbgmsg_maxsize
echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
# cause the panic
cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
This will use up 1G of system ram for debug messages, by the way so you might want to tune zfs_dbgmsg_maxsize
back once we have everything all figured out (or rebooting will clear the value).
not issuing the echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
command?
Yes. I'm sorry. Thanks for pointing that out. I updated the commands above.
so restart over again :) btw missing space for the dbgmsg_maxsize before the > :)
Now we have output:
root@servi-nixos:~# zfs destroy -r serviTank/encZFS/BU/subi/Nixos
cannot open 'serviTank/encZFS/BU/subi/Nixos': dataset does not exist
root@servi-nixos:~# zfs list
NAME USED AVAIL REFER MOUNTPOINT
serviTank 90.1G 3.42T 96K /serviTank
serviTank/encZFS 55.5G 3.42T 1.39M none
serviTank/encZFS/BU 2.78M 3.42T 1.39M none
serviTank/encZFS/BU/subi 1.39M 3.42T 1.39M none
serviTank/encZFS/Nixos 55.5G 3.42T 10.7G legacy
root@servi-nixos:~# echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
root@servi-nixos:~# echo 1073741824 > /sys/module/zfs/parameters/zfs_dbgmsg_maxsize
root@servi-nixos:~# echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
root@servi-nixos:~# dmesg | tail -n 40
[ 77.492055] audit: type=1300 audit(1510001435.468:38): arch=c000003e syscall=54 success=yes exit=0 a0=4 a1=29 a2=40 a3=15d55c0 items=0 ppid=6321 pid=6550 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ip6tables" exe="/nix/store/hnwjdbj5w41sxw60cf4nfa93z8cr59nb-iptables-1.6.1/bin/xtables-multi" key=(null)
[ 77.492057] audit: type=1327 audit(1510001435.468:38): proctitle=6970367461626C6573002D77002D7400726177002D4E006E69786F732D66772D727066696C746572
[ 77.575067] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
[ 80.088561] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[ 80.917907] systemd-journald[6236]: Failed to set ACL on /var/log/journal/cc00e28c9c9241ad97b93019da77628d/user-1000.journal, ignoring: Operation not supported
[ 90.931897] wireguard: WireGuard 0.0.20171017 loaded. See www.wireguard.com for information.
[ 90.931898] wireguard: Copyright (C) 2015-2017 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[ 1244.050819] VERIFY3(0 == dmu_object_dirty_raw(os, object, tx)) failed (0 == 17)
[ 1244.050842] PANIC at dmu.c:937:dmu_free_long_object_impl()
[ 1244.050854] Showing stack for process 8103
[ 1244.050856] CPU: 6 PID: 8103 Comm: receive_writer Tainted: P O 4.9.58 #1-NixOS
[ 1244.050856] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[ 1244.050857] ffffa13d43ac7b38 ffffffff842f7a12 ffffffffc057ce03 00000000000003a9
[ 1244.050859] ffffa13d43ac7b48 ffffffffc032d9f2 ffffa13d43ac7cd0 ffffffffc032dac5
[ 1244.050860] 0000000000000028 ffffa13d00000030 ffffa13d43ac7ce0 ffffa13d43ac7c80
[ 1244.050861] Call Trace:
[ 1244.050865] [<ffffffff842f7a12>] dump_stack+0x63/0x81
[ 1244.050869] [<ffffffffc032d9f2>] spl_dumpstack+0x42/0x50 [spl]
[ 1244.050870] [<ffffffffc032dac5>] spl_panic+0xc5/0x100 [spl]
[ 1244.050890] [<ffffffffc04bb3c2>] ? __set_error+0x22/0x30 [zfs]
[ 1244.050900] [<ffffffffc0448107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1244.050909] [<ffffffffc0448443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1244.050910] [<ffffffff845671e2>] ? mutex_lock+0x12/0x30
[ 1244.050922] [<ffffffffc047784b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1244.050931] [<ffffffffc044820b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1244.050939] [<ffffffffc0432a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1244.050947] [<ffffffffc0432b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1244.050955] [<ffffffffc043f858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1244.050963] [<ffffffffc043fcb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1244.050965] [<ffffffff841ce069>] ? __slab_free+0xe9/0x2e0
[ 1244.050966] [<ffffffff840a5200>] ? set_next_entity+0x70/0x890
[ 1244.050968] [<ffffffffc0327f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1244.050969] [<ffffffffc032a5d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1244.050977] [<ffffffffc043f910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1244.050978] [<ffffffffc032a5d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1244.050980] [<ffffffffc032a642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1244.050981] [<ffffffff8408e457>] kthread+0xd7/0xf0
[ 1244.050981] [<ffffffff8408e380>] ? kthread_park+0x60/0x60
[ 1244.050982] [<ffffffff8408e380>] ? kthread_park+0x60/0x60
[ 1244.050983] [<ffffffff8456a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
1510001732 dnode.c:1421:dnode_hold_impl(): error 17
1510002335 dnode.c:1421:dnode_hold_impl(): error 17
1510002602 dnode.c:1421:dnode_hold_impl(): error 17
1510002602 dnode.c:1421:dnode_hold_impl(): error 17
Wonderful. I should be able to look at this tonight. Thanks for the help.
doesn't look to me like that's of any help... but well :) thanks for your work.
It's the line number. That should be enough for me to figure it out but I'll post here if it's not
Good luck.
Oh. I think I already fixed this actually: 35df0bb5565c81ba52de430108b37eced9de0606.
It looks like thats the only encryption fix (so far) that you don't have. Sorry for all the confusion. I'll close this for now. Feel free to reopen the issue if you try the new code and see that its still broken.
Hmmm, that was fixed 19 days ago... why haven't I that one yet :) well, thanks for checking :)
Sorry for taking this long. There was a break-in and entry in the office and we had to resetup everything...
Anyway, I tested now with latest git master and still same issues (to me):
root@servi-nixos:~# zfs destroy -r serviTank/encZFS/BU/subi/Nixos
root@servi-nixos:~# zfs list
NAME USED AVAIL REFER MOUNTPOINT
serviTank 366G 3.15T 96K /serviTank
serviTank/encZFS 271G 3.15T 1.39M none
serviTank/encZFS/BU 195G 3.15T 1.39M none
serviTank/encZFS/BU/jus-law 195G 3.15T 1.39M none
serviTank/encZFS/BU/jus-law/Nixos-2017-11-12_05-00 1.68G 3.15T 1.68G none
serviTank/encZFS/BU/jus-law/Nixos-2017-11-12_13-00 1.68G 3.15T 1.68G none
serviTank/encZFS/BU/jus-law/VMs-2017-11-12_05-00 95.4G 3.15T 95.4G none
serviTank/encZFS/BU/jus-law/VMs-2017-11-12_13-00 96.1G 3.15T 96.1G none
serviTank/encZFS/BU/subi 1.39M 3.15T 1.39M none
serviTank/encZFS/Nixos 76.5G 3.15T 14.2G legacy
root@servi-nixos:~# echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
root@servi-nixos:~# echo 1073741824 > /sys/module/zfs/parameters/zfs_dbgmsg_maxsize
root@servi-nixos:~# echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
[ 144.227900] wireguard: WireGuard 0.0.20171101 loaded. See www.wireguard.com for information.
[ 144.227901] wireguard: Copyright (C) 2015-2017 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[ 1455.640657] VERIFY3(0 == dmu_object_set_nlevels(rwa->os, drro->drr_object, drro->drr_nlevels, tx)) failed (0 == 22)
[ 1455.640683] PANIC at dmu_send.c:2540:receive_object()
[ 1455.640694] Showing stack for process 24166
[ 1455.640696] CPU: 0 PID: 24166 Comm: receive_writer Tainted: P O 4.9.61 #1-NixOS
[ 1455.640696] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[ 1455.640697] ffffb1001724fb68 ffffffffad0f7ba2 ffffffffc03f9e64 00000000000009ec
[ 1455.640699] ffffb1001724fb78 ffffffffc01a95c2 ffffb1001724fd00 ffffffffc01a9695
[ 1455.640700] 0000000000000000 0000000000000030 ffffb1001724fd10 ffffb1001724fcb0
[ 1455.640701] Call Trace:
[ 1455.640705] [<ffffffffad0f7ba2>] dump_stack+0x63/0x81
[ 1455.640709] [<ffffffffc01a95c2>] spl_dumpstack+0x42/0x50 [spl]
[ 1455.640710] [<ffffffffc01a9695>] spl_panic+0xc5/0x100 [spl]
[ 1455.640712] [<ffffffffc01a3f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1455.640734] [<ffffffffc02c45b3>] ? dnode_rele_and_unlock+0x53/0x80 [zfs]
[ 1455.640744] [<ffffffffc02c4619>] ? dnode_rele+0x39/0x40 [zfs]
[ 1455.640753] [<ffffffffc02b8d8f>] receive_object+0x51f/0x720 [zfs]
[ 1455.640762] [<ffffffffc02c4619>] ? dnode_rele+0x39/0x40 [zfs]
[ 1455.640770] [<ffffffffc02bbf36>] receive_writer_thread+0x3c6/0xd50 [zfs]
[ 1455.640772] [<ffffffffacfce0a9>] ? __slab_free+0xe9/0x2e0
[ 1455.640773] [<ffffffffacea5240>] ? set_next_entity+0x70/0x890
[ 1455.640775] [<ffffffffc01a3f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1455.640776] [<ffffffffc01a65d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1455.640785] [<ffffffffc02bbb70>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1455.640786] [<ffffffffc01a65d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1455.640787] [<ffffffffc01a6642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1455.640788] [<fffffffface8e497>] kthread+0xd7/0xf0
[ 1455.640789] [<fffffffface8e3c0>] ? kthread_park+0x60/0x60
[ 1455.640790] [<fffffffface8e3c0>] ? kthread_park+0x60/0x60
[ 1455.640791] [<ffffffffad36a155>] ret_from_fork+0x25/0x30
[ 1598.279468] INFO: task txg_quiesce:8945 blocked for more than 120 seconds.
[ 1598.279546] Tainted: P O 4.9.61 #1-NixOS
[ 1598.279599] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1598.279670] txg_quiesce D 0 8945 2 0x00000000
[ 1598.279677] ffff9b75c992a800 0000000000000000 ffff9b75efb97e40 ffff9b75a7b4ea00
[ 1598.279683] ffff9b75cc354f80 ffffb1001d7a3d30 ffffffffad365082 ffffb1001d7a3d00
[ 1598.279688] 0000000000000246 000000018020001f ffffb1001d7a3d50 ffff9b75a7b4ea00
[ 1598.279694] Call Trace:
[ 1598.279710] [<ffffffffad365082>] ? __schedule+0x192/0x660
[ 1598.279716] [<ffffffffad365586>] schedule+0x36/0x80
[ 1598.279729] [<ffffffffc01ab888>] cv_wait_common+0x128/0x140 [spl]
[ 1598.279736] [<ffffffffacead3d0>] ? wake_atomic_t_function+0x60/0x60
[ 1598.279744] [<ffffffffc01ab8b5>] __cv_wait+0x15/0x20 [spl]
[ 1598.279830] [<ffffffffc0312a13>] txg_quiesce_thread+0x2e3/0x3f0 [zfs]
[ 1598.279883] [<ffffffffc0312730>] ? txg_wait_open+0x100/0x100 [zfs]
[ 1598.279891] [<ffffffffc01a65d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1598.279897] [<ffffffffc01a6642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1598.279902] [<fffffffface8e497>] kthread+0xd7/0xf0
[ 1598.279905] [<fffffffface8e3c0>] ? kthread_park+0x60/0x60
[ 1598.279908] [<fffffffface8e3c0>] ? kthread_park+0x60/0x60
[ 1598.279912] [<ffffffffad36a155>] ret_from_fork+0x25/0x30
[ 1598.279931] INFO: task zfs:15764 blocked for more than 120 seconds.
[ 1598.279990] Tainted: P O 4.9.61 #1-NixOS
[ 1598.280041] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1598.280112] zfs D 0 15764 15753 0x00000000
[ 1598.280116] ffff9b75afe30400 0000000000000000 ffff9b75efb57e40 ffff9b75c995a7c0
[ 1598.280121] ffff9b75cc354240 ffffb10003adb8c8 ffffffffad365082 ffffffffc02a2e06
[ 1598.280126] ffff9b72d4689090 0000000000000001 ffffb10003adb8d0 ffff9b75c995a7c0
[ 1598.280131] Call Trace:
[ 1598.280138] [<ffffffffad365082>] ? __schedule+0x192/0x660
[ 1598.280180] [<ffffffffc02a2e06>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1598.280186] [<ffffffffad365586>] schedule+0x36/0x80
[ 1598.280194] [<ffffffffc01ab888>] cv_wait_common+0x128/0x140 [spl]
[ 1598.280199] [<ffffffffacead3d0>] ? wake_atomic_t_function+0x60/0x60
[ 1598.280205] [<ffffffffc01ab8b5>] __cv_wait+0x15/0x20 [spl]
[ 1598.280246] [<ffffffffc02a81d2>] bqueue_enqueue+0x62/0xe0 [zfs]
[ 1598.280294] [<ffffffffc02bdb21>] dmu_recv_stream+0x691/0x11c0 [zfs]
[ 1598.280302] [<ffffffffc01c462a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1598.280355] [<ffffffffc0344d1a>] ? zfs_set_prop_nvlist+0x2fa/0x510 [zfs]
[ 1598.280404] [<ffffffffc0345477>] zfs_ioc_recv_impl+0x407/0x1170 [zfs]
[ 1598.280472] [<ffffffffc0346819>] zfs_ioc_recv_new+0x369/0x400 [zfs]
[ 1598.280481] [<ffffffffc01a42cc>] ? spl_kmem_alloc_impl+0x9c/0x180 [spl]
[ 1598.280488] [<ffffffffc01a64a9>] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1598.280495] [<ffffffffc01c98af>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1598.280501] [<ffffffffc01c462a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1598.280507] [<ffffffffc01c46ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1598.280555] [<ffffffffc034350b>] zfsdev_ioctl+0x20b/0x660 [zfs]
[ 1598.280568] [<ffffffffacfff6e4>] do_vfs_ioctl+0x94/0x5c0
[ 1598.280575] [<fffffffface5dece>] ? __do_page_fault+0x25e/0x4c0
[ 1598.280580] [<ffffffffacfffc89>] SyS_ioctl+0x79/0x90
[ 1598.280585] [<ffffffffad369ef7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1598.280595] INFO: task receive_writer:24166 blocked for more than 120 seconds.
[ 1598.280662] Tainted: P O 4.9.61 #1-NixOS
[ 1598.280713] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1598.280783] receive_writer D 0 24166 2 0x00000000
[ 1598.280787] ffff9b75c9b13000 0000000000000000 ffff9b75efa17e40 ffff9b75a7b4cf80
[ 1598.280792] ffff9b75cbe027c0 ffffb1001724fb60 ffffffffad365082 0000000000000000
[ 1598.280797] ffffffffc03e4728 00ffffffc03fa708 0000000000000000 ffff9b75a7b4cf80
[ 1598.280802] Call Trace:
[ 1598.280809] [<ffffffffad365082>] ? __schedule+0x192/0x660
[ 1598.280815] [<ffffffffad365586>] schedule+0x36/0x80
[ 1598.280823] [<ffffffffc01a96bb>] spl_panic+0xeb/0x100 [spl]
[ 1598.280831] [<ffffffffc01a3f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1598.280883] [<ffffffffc02c45b3>] ? dnode_rele_and_unlock+0x53/0x80 [zfs]
[ 1598.280931] [<ffffffffc02c4619>] ? dnode_rele+0x39/0x40 [zfs]
[ 1598.280977] [<ffffffffc02b8d8f>] receive_object+0x51f/0x720 [zfs]
[ 1598.281023] [<ffffffffc02c4619>] ? dnode_rele+0x39/0x40 [zfs]
[ 1598.281068] [<ffffffffc02bbf36>] receive_writer_thread+0x3c6/0xd50 [zfs]
[ 1598.281073] [<ffffffffacfce0a9>] ? __slab_free+0xe9/0x2e0
[ 1598.281080] [<ffffffffacea5240>] ? set_next_entity+0x70/0x890
[ 1598.281086] [<ffffffffc01a3f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1598.281093] [<ffffffffc01a65d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1598.281136] [<ffffffffc02bbb70>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1598.281142] [<ffffffffc01a65d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1598.281149] [<ffffffffc01a6642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1598.281152] [<fffffffface8e497>] kthread+0xd7/0xf0
[ 1598.281156] [<fffffffface8e3c0>] ? kthread_park+0x60/0x60
[ 1598.281158] [<fffffffface8e3c0>] ? kthread_park+0x60/0x60
[ 1598.281163] [<ffffffffad36a155>] ret_from_fork+0x25/0x30
modinfo zfs
filename: /run/current-system/kernel-modules/lib/modules/4.9.61/extra/zfs/zfs.ko.xz
version: 0.7.0-1
license: CDDL
author: OpenZFS on Linux
description: ZFS
srcversion: FF479BCC41F37E7DD1898D9
depends: spl,znvpair,zcommon,icp,zunicode,zavl
vermagic: 4.9.61 SMP mod_unload
root@servi-nixos:~# modinfo spl
filename: /run/current-system/kernel-modules/lib/modules/4.9.61/extra/spl/spl.ko.xz
version: 0.7.0-1
license: GPL
author: OpenZFS on Linux
description: Solaris Porting Layer
srcversion: E9C0243BEDADDD2716B8065
depends:
vermagic: 4.9.61 SMP mod_unload
In that case let me reopen this.
@sjau It seems your error is manifesting differently now. Can you provide me the output of
cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 22'
Now would also probably be a good time to mention that I am aware of some issues with raw sends which i am currently working on. I don't know if your issue is related to this, but I hope to have a PR opened to fix these issues tomorrow (although it will probably take some time before it is merged). You can follow this at #6845.
There we go:
using zfs rev. 5277f208f290ea4e2204800a66c3ba20a03fe503 and spl rev. 35a44fcb8d6e346f51be82dfe57562c2ea0c6a9c
[ 126.133956] wireguard: Copyright (C) 2015-2017 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[44974.528445] refused connection: IN=eth0 OUT= MAC=d0:50:99:5c:cb:65:d8:58:d7:00:1d:a6:08:00 SRC=106.120.86.130 DST=10.0.0.3 LEN=40 TOS=0x00 PREC=0x00 TTL=37 ID=1514 PROTO=TCP SPT=34145 DPT=6882 WINDOW=1400 RES=0x00 SYN URGP=0
[61625.231168] refused connection: IN=eth0 OUT= MAC=d0:50:99:5c:cb:65:d8:58:d7:00:1d:a6:08:00 SRC=106.120.86.130 DST=10.0.0.3 LEN=40 TOS=0x00 PREC=0x00 TTL=37 ID=39254 PROTO=TCP SPT=34145 DPT=6882 WINDOW=1400 RES=0x00 SYN URGP=0
[122460.464918] VERIFY3(0 == dmu_object_set_nlevels(rwa->os, drro->drr_object, drro->drr_nlevels, tx)) failed (0 == 22)
[122460.464944] PANIC at dmu_send.c:2540:receive_object()
[122460.464996] Showing stack for process 9158
[122460.464998] CPU: 0 PID: 9158 Comm: receive_writer Tainted: P O 4.9.61 #1-NixOS
[122460.464999] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[122460.465000] ffffad51d01afb68 ffffffff9f8f7ba2 ffffffffc0598e64 00000000000009ec
[122460.465002] ffffad51d01afb78 ffffffffc03485c2 ffffad51d01afd00 ffffffffc0348695
[122460.465004] 0000000000000000 0000000000000030 ffffad51d01afd10 ffffad51d01afcb0
[122460.465006] Call Trace:
[122460.465011] [<ffffffff9f8f7ba2>] dump_stack+0x63/0x81
[122460.465016] [<ffffffffc03485c2>] spl_dumpstack+0x42/0x50 [spl]
[122460.465018] [<ffffffffc0348695>] spl_panic+0xc5/0x100 [spl]
[122460.465020] [<ffffffffc0342f53>] ? spl_kmem_free+0x33/0x40 [spl]
[122460.465054] [<ffffffffc04635b3>] ? dnode_rele_and_unlock+0x53/0x80 [zfs]
[122460.465063] [<ffffffffc0463619>] ? dnode_rele+0x39/0x40 [zfs]
[122460.465072] [<ffffffffc0457d8f>] receive_object+0x51f/0x720 [zfs]
[122460.465081] [<ffffffffc0463619>] ? dnode_rele+0x39/0x40 [zfs]
[122460.465089] [<ffffffffc045af36>] receive_writer_thread+0x3c6/0xd50 [zfs]
[122460.465091] [<ffffffff9f7ce061>] ? __slab_free+0xa1/0x2e0
[122460.465093] [<ffffffff9f6a5240>] ? set_next_entity+0x70/0x890
[122460.465094] [<ffffffffc0342f53>] ? spl_kmem_free+0x33/0x40 [spl]
[122460.465095] [<ffffffffc03455d0>] ? __thread_exit+0x20/0x20 [spl]
[122460.465104] [<ffffffffc045ab70>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[122460.465105] [<ffffffffc03455d0>] ? __thread_exit+0x20/0x20 [spl]
[122460.465106] [<ffffffffc0345642>] thread_generic_wrapper+0x72/0x80 [spl]
[122460.465107] [<ffffffff9f68e497>] kthread+0xd7/0xf0
[122460.465108] [<ffffffff9f68e3c0>] ? kthread_park+0x60/0x60
[122460.465109] [<ffffffff9f68e3c0>] ? kthread_park+0x60/0x60
[122460.465110] [<ffffffff9fb6a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 22'
1510638852 zap.c:765:fzap_checksize(): error 22
1510638858 zap.c:765:fzap_checksize(): error 22
1510638863 zap.c:765:fzap_checksize(): error 22
[...]
1510639205 zap.c:765:fzap_checksize(): error 22
1510639206 zap.c:765:fzap_checksize(): error 22
1510639207 zap.c:765:fzap_checksize(): error 22
1510639207 dnode.c:1222:dnode_hold_impl(): error 22
1510639207 dnode.c:1222:dnode_hold_impl(): error 22
1510639207 dnode.c:1222:dnode_hold_impl(): error 22
[...]
1510639207 dnode.c:1222:dnode_hold_impl(): error 22
1510639207 dnode.c:1222:dnode_hold_impl(): error 22
1510639207 dnode.c:1222:dnode_hold_impl(): error 22
1510639207 dnode.c:1222:dnode_hold_impl(): error 22
1510639208 zap.c:765:fzap_checksize(): error 22
1510639209 zap.c:765:fzap_checksize(): error 22
1510639210 zap.c:765:fzap_checksize(): error 22
[...]
1510639994 zap.c:765:fzap_checksize(): error 22
1510639994 zap.c:765:fzap_checksize(): error 22
1510640005 zap.c:765:fzap_checksize(): error 22
1510640009 dnode.c:1712:dnode_set_nlevels(): error 22
1510640009 dnode.c:1712:dnode_set_nlevels(): error 22
1510640011 zap.c:765:fzap_checksize(): error 22
Btw, is there a way to send encrypted datasets not in raw mode? I know raw mode enables me to send an encrypted DS to any point and be sure to have it encrypted as well.
However, if the receiver has the top ds also encrypted, then newly created dataset should also be encrypted right? Hence in that case, raw mode seems not to be necessary to ensure that the child dataset will also be encrypted. However I couldn't figure that out.
Btw, is there a way to send encrypted datasets not in raw mode?
Yes simply load the key and use zfs send
without -w. Unfortunately though, for the moment -p
and -R
imply -w
.
However, if the receiver has the top ds also encrypted, then newly created dataset should also be encrypted right?
Correct. Raw send is useful if you want to backup to an untrusted server in such a way that they cannot read your data, but you can if they give it back to you. It basically just takes the encrypted ddata that is on-disk and transfers it directly as-is to the new destination.
So no incremental sending possbile without raw mode.
Btw, also had a read at https://github.com/zfsonlinux/zfs/pull/6864 and https://github.com/zfsonlinux/zfsonlinux.github.com/pull/35/files#diff-5c212f01f7fcb3ccece492a2f81b17caR202 - thanks for all the hard work you put in.
So no incremental sending possbile without raw mode.
Incrementals are still possible. If you are having issues with them please let me know.
OK. Now I should just need one last bit of info to debug this. Sorry for making you do all the work, but I still haven't been able to reproduce this locally.
Please apply this diff:
diff --git a/module/zfs/dnode.c b/module/zfs/dnode.c
index c1fbf3c..dc08b9a 100644
--- a/module/zfs/dnode.c
+++ b/module/zfs/dnode.c
@@ -1709,6 +1709,9 @@ dnode_set_nlevels(dnode_t *dn, int nlevels, dmu_tx_t *tx)
ret = 0;
goto out;
} else if (nlevels < dn->dn_nlevels) {
+ zfs_dbgmsg("HERE: obj = %llu nlevels = %d, dn_nlevels = %d",
+ (unsigned long long)dn->dn_object,
+ (int)nlevels, (int)dn->dn_nlevels);
ret = SET_ERROR(EINVAL);
goto out;
}
This simply adds a debugging line that will show up in /proc/spl/kstat/zfs/dbgmsg
. Once you have applied the diff, rebuilt the module and set the tunables, and installed it, cause the crash again. This time you should look for the new message with: cat /proc/spl/kstat/zfs/dbgmsg | grep 'HERE'
. Paste the output of that here.
You will notice that the output will include obj = <some number>
. Take this number and enter it into the following command, replacing the <dataset@snapshot>
and <number>
values:
zfs send <dataset@snapshot> | zstreamdump -v | grep -w 'OBJECT' | grep 'object = <number>'
Then provide the output of that command as well. Thanks again and hopefully that will be the last bit of info I need.
I'll have to see how I can apply this patch.... on NixOS it doesn't seem to be quite as trivial :(
If you can find the dnode_set_nlevels() function in dnode.c you should be able to just insert the lines indicated with +
signs manually. (you will need to remove the +
signs from the code to get it to compile.
The problem is nixos runs declaratively and I'm not sure if I movidy the source myself that it won't pull default from the repos.... I tried to advance the revisions used in a declarative mode but was not successfull. The way I succeeded was to mount the /nix/store in writeable mode, then find the zfs and spl default.nix file, alter the revision and sha sum directly there and have it rebuild without upgrade... that way it pulled the new revision sources...
Hmmm, couldn't apply the patch. Maybe I need another revision than the one I use.... I also assume it's for zfs and not spl?
applying patch /nix/store/q2pc8flj0l9a89izb0rgvr7jw1xcakp5-zfs_debug.patch
patching file module/zfs/dnode.c
Hunk #1 FAILED at 1709.
1 out of 1 hunk FAILED -- saving rejects to file module/zfs/dnode.c.rej
builder for ‘/nix/store/s440np4z3cm4fd7045wkhy7d4qqhir1a-zfs-kernel-2017-11-12-4.9.61.drv’ failed with exit code 1
The patch is for zfs. Can you find the file module/zfs/dnode.c
? You should be able to open that file and find this code:
int
dnode_set_nlevels(dnode_t *dn, int nlevels, dmu_tx_t *tx)
{
int ret = 0;
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
if (dn->dn_nlevels == nlevels) {
ret = 0;
goto out;
} else if (nlevels < dn->dn_nlevels) {
ret = SET_ERROR(EINVAL);
goto out;
}
dnode_set_nlevels_impl(dn, nlevels, tx);
out:
rw_exit(&dn->dn_struct_rwlock);
return (ret);
}
The you change it to this:
int
dnode_set_nlevels(dnode_t *dn, int nlevels, dmu_tx_t *tx)
{
int ret = 0;
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
if (dn->dn_nlevels == nlevels) {
ret = 0;
goto out;
} else if (nlevels < dn->dn_nlevels) {
ret = SET_ERROR(EINVAL);
zfs_dbgmsg("HERE: obj = %llu nlevels = %d, dn_nlevels = %d",
(unsigned long long)dn->dn_object,
(int)nlevels, (int)dn->dn_nlevels);
goto out;
}
dnode_set_nlevels_impl(dn, nlevels, tx);
out:
rw_exit(&dn->dn_struct_rwlock);
return (ret);
}
Although Im wondering if you are looking at the right code... Your dbgmsg error pointed at exactly the correct line number for me.
well, I used:
zfsUnstable = common {
# comment/uncomment if breaking kernel versions are known
incompatibleKernelVersion = null;
# this package should point to a version / git revision compatible with the latest kernel release
version = "2017-11-12";
rev = "5277f208f290ea4e2204800a66c3ba20a03fe503";
sha256 = "0hhlhv4g678j1w45813xfrk8zza0af59cdkmib9bkxy0cn0jsnd6";
isUnstable = true;
extraPatches = [
(fetchpatch {
url = "https://www.sjau.ch/zfs_debug.patch";
sha256 = "0cvq6g8yjgki32yrlzy3d95c695a82lmy1vx86flhil7ny1222b5";
})
(fetchpatch {
url = "https://github.com/Mic92/zfs/compare/ded8f06a3cfee...nixos-zfs-2017-09-12.patch";
sha256 = "033wf4jn0h0kp0h47ai98rywnkv5jwvf3xwym30phnaf8xxdx8aj";
})
];
spl = splUnstable;
};
}
Oh I just remembered. github changes tabs to spaces on diffs which screws with git apply. Try this: https://pastebin.com/AHncmrBR
it's building now :)
So, successfully crashed it again:
root@servi-nixos:~# echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
root@servi-nixos:~# echo 1073741824 > /sys/module/zfs/parameters/zfs_dbgmsg_maxsize
root@servi-nixos:~# echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'HERE'
1510685051 dnode.c:1715:dnode_set_nlevels(): HERE: obj = 393261 nlevels = 1, dn_nlevels = 2
1510685051 dnode.c:1715:dnode_set_nlevels(): HERE: obj = 393261 nlevels = 1, dn_nlevels = 2
You said to run now:
zfs send <dataset@snapshot> | zstreamdump -v | grep -w 'OBJECT' | grep 'object = 39261'
So, you want me to reboot the crashed server, remove the data set from it that was just sent and cause the crash. And then from the sending server to re-initiate zfs/send with the above command? Shouldn't that be outputted finally to /dev/null or something?
Or do you want me on the currently crashed server to run that command?
So, you want me to reboot the crashed server, remove the data set from it that was just sent and cause the crash. And then from the sending server to re-initiate zfs/send with the above command? Shouldn't that be outputted finally to /dev/null or something? Or do you want me on the currently crashed server to run that command?
This command should be run on the send side machine. You should be able to run it even if the server is crashed (I think), but if not just reboot it. zstreamdump
will just take the output of zfs send and make it human readable. The output doesn't need to be redirected anywhere. The zfs send
command before the first pipe should just match the send command you used to cause the crash.
I also just wanted to confirm that this dataset is using the default recordsize?
ok, it's doing it's thing... will be another 20-30 minutes
There we go:
root@subi:~# zfs send -wR tank/encZFS/Nixos@zfs-auto-snap_hourly-2017-11-14-19h00 | zstreamdump -v | grep -w 'OBJECT' | grep 'object = 39261'
OBJECT object = 392610 type = 19 bonustype = 44 blksz = 512 bonuslen = 312 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392611 type = 19 bonustype = 44 blksz = 512 bonuslen = 300 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392612 type = 19 bonustype = 44 blksz = 512 bonuslen = 298 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392613 type = 19 bonustype = 44 blksz = 512 bonuslen = 298 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392614 type = 19 bonustype = 44 blksz = 512 bonuslen = 298 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392615 type = 19 bonustype = 44 blksz = 512 bonuslen = 295 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392616 type = 19 bonustype = 44 blksz = 512 bonuslen = 299 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392617 type = 19 bonustype = 44 blksz = 512 bonuslen = 295 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392618 type = 19 bonustype = 44 blksz = 512 bonuslen = 302 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
OBJECT object = 392619 type = 19 bonustype = 44 blksz = 512 bonuslen = 303 dn_slots = 1 raw_bonuslen = 320 flags = 0 indblkshift = 17 nlevels = 1 nblkptr = 1
System information
Describe the problem you're observing
I'm trying to backup encrypted datasets from my notebook to my homeserver. Both run same Nixos version. However it doesn't work.
When I use
-wR
options for full dataset sending, then on the receiving end spl_panic appears and it never finishes.If I omit the
-R
option, then full dataset sending works. However when I then try send an incremental set, dame things happens again - spl_panicDescribe how to reproduce the problem
On notebook I have: tank/encZFS/Nixos -> encZFS was create this way:
zfs create -o encryption=aes-256-gcm -o keyformat=passphrase -o mountpoint=none -o atime=off ${zfsPool}/encZFS
On the server I have: serviTank/BU/subi -> None of those is encrypted
I then took a snapshot and tried to send like this:
zfs send -wR tank/encZFS/Nixos@encZFSSend_2017-11-04_12-31 | ssh root@10.0.0.3 'zfs receive serviTank/BU/subi/Nixos'
It seems all was transferred correctly:
However the zfs send/receive command never "finishes" and on the server side dmesg shows spl_panic
As said, if I don't use the
-R
option on first dataset sending to server it works fine, but when I then try to send an incremental snapshot the same thing happens.I also tried to send the snapshots to an encrypted child dataset on the server with the same results.