Closed phreaker0 closed 6 years ago
I'll try to get a USB drive on Monday and look into this. The error here is that the code is attempting to encrypt new data before you've loaded your keys. The zpool import -F
code must be attempting to write some data somewhere as part of the recovery process. I'll see if I can figure out what's going on.
I think I know what's going on. This should be a simple 1 line fix. Unfortunately, the problem is very hard for me to reproduce and I have not been able to test if it works yet. I will add it to #6864
That sounds good, thx for your work. I still have the original pool and also a dd backup of it, so I can test any patch you provide.
If you have it and have some time to test it: https://pastebin.com/wBJV59a8
I will, but the disk is at work so it takes till tomorrow.
Unfortunately I get the same Panic again with the patch applied:
Dez 05 09:19:53 token kernel: SPL: Loaded module v0.7.0-21_ged19bcc
Dez 05 09:19:53 token kernel: icp: module license 'CDDL' taints kernel.
Dez 05 09:19:53 token kernel: Disabling lock debugging due to kernel taint
Dez 05 09:19:54 token kernel: ZFS: Loaded module v0.7.0-204_g7b34070, ZFS pool version 5000, ZFS filesystem version 5
Dez 05 09:21:01 token kernel: ZFS: Unable to set "noop" scheduler for /dev/disk/by-id/usb-ORICO_6518US3_0123456789ABCDE-0:0-part1 (sdb): 256
Dez 05 09:21:08 token kernel: ZFS: Unable to set "noop" scheduler for /dev/disk/by-id/usb-ORICO_6518US3_0123456789ABCDE-0:0-part1 (sdb): 256
Dez 05 09:22:20 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, zio->io_bookmark.zb_objset, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Dez 05 09:22:20 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, zio->io_bookmark.zb_objset, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Dez 05 09:22:20 token kernel: PANIC at zio.c:3705:zio_encrypt()
Dez 05 09:22:20 token kernel: Showing stack for process 3575
Dez 05 09:22:20 token kernel: CPU: 2 PID: 3575 Comm: z_wr_iss Tainted: P OE 4.14.0-pf5+ #1
Dez 05 09:22:20 token kernel: Hardware name: Acer Aspire V5-572G/Dazzle_CX , BIOS V2.07 07/12/2013
Dez 05 09:22:20 token kernel: PANIC at zio.c:3705:zio_encrypt()
Dez 05 09:22:20 token kernel: Call Trace:
Dez 05 09:22:20 token kernel: Showing stack for process 3574
Dez 05 09:22:20 token kernel: dump_stack+0x63/0x8b
Dez 05 09:22:20 token kernel: spl_dumpstack+0x42/0x50 [spl]
Dez 05 09:22:20 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, zio->io_bookmark.zb_objset, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Dez 05 09:22:20 token kernel: PANIC at zio.c:3705:zio_encrypt()
Dez 05 09:22:20 token kernel: Showing stack for process 3576
Dez 05 09:22:20 token kernel: spl_panic+0xc8/0x110 [spl]
Dez 05 09:22:20 token kernel: ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Dez 05 09:22:20 token kernel: zio_encrypt+0x5d0/0x6b0 [zfs]
Dez 05 09:22:20 token kernel: ? zio_buf_free+0x58/0x60 [zfs]
Dez 05 09:22:20 token kernel: ? zio_write_compress+0x496/0x6a0 [zfs]
Dez 05 09:22:20 token kernel: ? __schedule+0x3ca/0x890
Dez 05 09:22:20 token kernel: zio_execute+0x8a/0xf0 [zfs]
Dez 05 09:22:20 token kernel: taskq_thread+0x2ab/0x4d0 [spl]
Dez 05 09:22:20 token kernel: ? wake_up_q+0x70/0x70
Dez 05 09:22:20 token kernel: ? zio_reexecute+0x3d0/0x3d0 [zfs]
Dez 05 09:22:20 token kernel: kthread+0x108/0x140
Dez 05 09:22:20 token kernel: ? taskq_thread_should_stop+0x70/0x70 [spl]
Dez 05 09:22:20 token kernel: ? kthread_create_on_node+0x70/0x70
Dez 05 09:22:20 token kernel: ret_from_fork+0x25/0x30
...
Hmmmm. And you're sure that you are using the new module? That's unfortunate. I was really pretty sure that this would have fixed the problem..... I'll try to come up with some steps to reproduce it.
I just recompiled to make sure I was, but same result. I'm using the DKMS method and I verified that the source for dkms (/usr/src/zfs-0.7.0/module/zfs/zil.c) contains the patch.
Ok. And I'm assuming you re-inserted the module after installing. DKMS doesn't actually do this for you so you would still be running the old code. But if you rebooted it should load the (new) installed version.
I will see if I can reproduce the issue locally.
Yeah, I rebooted to make sure.
I tested with the latest master version (v0.7.0-277_g0d23f5e) and unfortunately I still have the same issue:
root@token:~$ zpool import
pool: ext
id: 15081949372965998973
state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
see: http://zfsonlinux.org/msg/ZFS-8000-72
config:
ext FAULTED corrupted data
sdb ONLINE
root@token:~$ zpool import ext
cannot import 'ext': I/O error
Recovery is possible, but will result in some data loss.
Returning the pool to its state as of Mo 27 Nov 2017 09:05:11 CET
should correct the problem. Approximately 61 seconds of data
must be discarded, irreversibly. Recovery can be attempted
by executing 'zpool import -F ext'. A scrub of the pool
is strongly recommended after recovery.
root@token:~$ zpool import -N -F ext
Feb 05 10:26:42 token kernel: sdb: sdb1 sdb9
Feb 05 10:26:42 token kernel: sd 6:0:0:0: [sdb] Attached SCSI disk
Feb 05 10:26:42 token kernel: SPL: Loaded module v0.7.0-24_g23602fd
Feb 05 10:26:42 token kernel: icp: module license 'CDDL' taints kernel.
Feb 05 10:26:42 token kernel: Disabling lock debugging due to kernel taint
Feb 05 10:26:44 token kernel: ZFS: Loaded module v0.7.0-277_g0d23f5e, ZFS pool version 5000, ZFS filesystem version 5
Feb 05 10:27:15 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:28:07 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:28:07 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:28:07 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:28:07 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:29:01 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:29:01 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:29:01 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:29:02 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:29:10 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, dsobj, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Feb 05 10:29:10 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, dsobj, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Feb 05 10:29:10 token kernel: PANIC at zio.c:3844:zio_encrypt()
Feb 05 10:29:10 token kernel: Showing stack for process 26845
Feb 05 10:29:10 token kernel: PANIC at zio.c:3844:zio_encrypt()
Feb 05 10:29:10 token kernel: CPU: 3 PID: 26845 Comm: z_wr_iss Tainted: P OE 4.15.0-pf1+ #1
Feb 05 10:29:10 token kernel: Hardware name: Acer Aspire V5-572G/Dazzle_CX , BIOS V2.07 07/12/2013
Feb 05 10:29:10 token kernel: Showing stack for process 26846
Feb 05 10:29:10 token kernel: Call Trace:
Feb 05 10:29:10 token kernel: dump_stack+0x63/0x8b
Feb 05 10:29:10 token kernel: spl_dumpstack+0x42/0x50 [spl]
Feb 05 10:29:10 token kernel: spl_panic+0xc8/0x110 [spl]
Feb 05 10:29:10 token kernel: ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Feb 05 10:29:10 token kernel: ? spl_kmem_cache_free+0x139/0x1c0 [spl]
Feb 05 10:29:10 token kernel: zio_encrypt+0x68a/0x770 [zfs]
Feb 05 10:29:10 token kernel: ? zio_write_compress+0x496/0x6a0 [zfs]
Feb 05 10:29:10 token kernel: ? __schedule+0x3cd/0x890
Feb 05 10:29:10 token kernel: zio_execute+0x8a/0xf0 [zfs]
Feb 05 10:29:10 token kernel: taskq_thread+0x2ab/0x4d0 [spl]
Feb 05 10:29:10 token kernel: ? wake_up_q+0x70/0x70
Feb 05 10:29:10 token kernel: ? zio_reexecute+0x430/0x430 [zfs]
Feb 05 10:29:10 token kernel: kthread+0x102/0x140
Feb 05 10:29:10 token kernel: ? taskq_thread_should_stop+0x70/0x70 [spl]
Feb 05 10:29:10 token kernel: ? kthread_associate_blkcg+0x90/0x90
Feb 05 10:29:10 token kernel: ? kthread_associate_blkcg+0x90/0x90
Feb 05 10:29:10 token kernel: ret_from_fork+0x35/0x40
Feb 05 10:29:10 token kernel: CPU: 1 PID: 26846 Comm: z_wr_iss Tainted: P OE 4.15.0-pf1+ #1
Feb 05 10:29:10 token kernel: Hardware name: Acer Aspire V5-572G/Dazzle_CX , BIOS V2.07 07/12/2013
Feb 05 10:29:10 token kernel: Call Trace:
Feb 05 10:29:10 token kernel: dump_stack+0x63/0x8b
Feb 05 10:29:10 token kernel: spl_dumpstack+0x42/0x50 [spl]
Feb 05 10:29:10 token kernel: spl_panic+0xc8/0x110 [spl]
Feb 05 10:29:10 token kernel: ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Feb 05 10:29:10 token kernel: ? spl_kmem_cache_free+0x139/0x1c0 [spl]
Feb 05 10:29:10 token kernel: zio_encrypt+0x68a/0x770 [zfs]
Feb 05 10:29:10 token kernel: ? zio_write_compress+0x496/0x6a0 [zfs]
Feb 05 10:29:10 token kernel: ? __schedule+0x3cd/0x890
Feb 05 10:29:10 token kernel: zio_execute+0x8a/0xf0 [zfs]
Feb 05 10:29:10 token kernel: taskq_thread+0x2ab/0x4d0 [spl]
Feb 05 10:29:10 token kernel: ? wake_up_q+0x70/0x70
Feb 05 10:29:10 token kernel: ? zio_reexecute+0x430/0x430 [zfs]
Feb 05 10:29:10 token kernel: kthread+0x102/0x140
Feb 05 10:29:10 token kernel: ? taskq_thread_should_stop+0x70/0x70 [spl]
Feb 05 10:29:10 token kernel: ? kthread_associate_blkcg+0x90/0x90
Feb 05 10:29:10 token kernel: ret_from_fork+0x35/0x40
Feb 05 10:29:10 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, dsobj, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Feb 05 10:29:10 token kernel: PANIC at zio.c:3844:zio_encrypt()
Feb 05 10:29:10 token kernel: Showing stack for process 26847
Feb 05 10:29:10 token kernel: CPU: 1 PID: 26847 Comm: z_wr_iss Tainted: P OE 4.15.0-pf1+ #1
Feb 05 10:29:10 token kernel: Hardware name: Acer Aspire V5-572G/Dazzle_CX , BIOS V2.07 07/12/2013
Feb 05 10:29:10 token kernel: Call Trace:
Feb 05 10:29:10 token kernel: dump_stack+0x63/0x8b
Feb 05 10:29:10 token kernel: spl_dumpstack+0x42/0x50 [spl]
Feb 05 10:29:10 token kernel: spl_panic+0xc8/0x110 [spl]
Feb 05 10:29:10 token kernel: ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Feb 05 10:29:10 token kernel: ? spl_kmem_cache_free+0x139/0x1c0 [spl]
Feb 05 10:29:10 token kernel: zio_encrypt+0x68a/0x770 [zfs]
Feb 05 10:29:10 token kernel: ? zio_write_compress+0x496/0x6a0 [zfs]
Feb 05 10:29:10 token kernel: ? __wake_up_common_lock+0x8e/0xc0
Feb 05 10:29:10 token kernel: zio_execute+0x8a/0xf0 [zfs]
Feb 05 10:29:10 token kernel: taskq_thread+0x2ab/0x4d0 [spl]
Feb 05 10:29:10 token kernel: ? wake_up_q+0x70/0x70
Feb 05 10:29:10 token kernel: ? zio_reexecute+0x430/0x430 [zfs]
Feb 05 10:29:10 token kernel: kthread+0x102/0x140
Feb 05 10:29:10 token kernel: ? taskq_thread_should_stop+0x70/0x70 [spl]
Feb 05 10:29:10 token kernel: ? kthread_associate_blkcg+0x90/0x90
Feb 05 10:29:10 token kernel: ret_from_fork+0x35/0x40
@phreaker0 Hmmmm. Can i ask how large this pool is and how sensitive the data on it is? At this point, I have checked all of the code paths that I can think of related to pool import and I can't find anything that might be causing this. Without being able to reproduce the issue myself, I'm a bit lost on what to do. If I could poke around with it I might be able to figure it out.
Feel free to email me at tcaputi@datto.com as well.
@tcaputi The pool is 1TB in size with about 80% used. It only contains encrypted datasets so I have no problem if you need access to it. But I think it is too big for sending the whole thing. Maybe I can share the disk as block device via iSCSI or something (I never used iSCSI but it sounds like it would be useful here) or do you think it would be too slow?
ISCSI will almost definitely not work over WAN. The only way I can think to do this is if you can plug the drive into a spare machine somewhere that you don't mind me having root access to (maybe passthrough to a VM). You could then expose the port over the WAN and privately email me ssh credentials to the VM so I can access it. I will need to be able to reboot this VM a few times to figure out what's going on. I understand if this is too much to ask.
no problem, i will setup a vm and email you the details when ready
Ok. I have found the cause of the issue and will be making a PR for it shortly. Basically, ZIL claiming was accidentally causing user accounting to occur even though the keys weren't loaded yet. Thanks for the help with this @phreaker0 .
System information
Describe the problem you're observing
I'm using an ZFS pool on an external USB disk for backup and also for data. Everything is encrypted by using a top level dataset with activated encryption (aes-256-gcm). Sometimes the usb disk disconnects (because the cable loosens, ...), in that case the pool i/o will be suspended. After reconnecting the drive I can resume the pool by some zpool command (i think I always uses the zpool clear one). Until the last time it disconnected, i reconnected the drive and uses the command again to resume pool i/o, but it just hanged. After a day or so I rebooted the system (all the zfs/zpool commands were stuck at this point). I tried to import the zfs pool but it couldn't because of corrupted metadata and suggested recovery import to rollback the last transactions. Attempting to do so freezes the import command and nothing is happening after some panics in the log. (I tried with the zfs_recover =1 option)
What worked is importing the pool in read only mount with recovery: zpool import -F -N -o readonly=on ext I don't care about the broken pool, as I could send all the snapshots of the data in readonly mode to another pool and most of the backup data too (one dataset snapshot send/recv failed to complete, I don't know why). But maybe there is an issue with recovery code in case of encryption?
I always use the latest kernel and zfs master version on this machine (root filesystem is also on an ZFS SSD pool)
Describe how to reproduce the problem
not easily possible I guess
Include any warning/errors/backtraces from the system logs