openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.69k stars 1.75k forks source link

recovery import (zpool import -F) issue with encrypted pool #6916

Closed phreaker0 closed 6 years ago

phreaker0 commented 6 years ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 16.04.3 LTS
Linux Kernel 4.14.0-pf4+
Architecture x86_64
ZFS Version v0.7.0-198_g56d8d8a
SPL Version v0.7.0-21_ged19bcc

Describe the problem you're observing

I'm using an ZFS pool on an external USB disk for backup and also for data. Everything is encrypted by using a top level dataset with activated encryption (aes-256-gcm). Sometimes the usb disk disconnects (because the cable loosens, ...), in that case the pool i/o will be suspended. After reconnecting the drive I can resume the pool by some zpool command (i think I always uses the zpool clear one). Until the last time it disconnected, i reconnected the drive and uses the command again to resume pool i/o, but it just hanged. After a day or so I rebooted the system (all the zfs/zpool commands were stuck at this point). I tried to import the zfs pool but it couldn't because of corrupted metadata and suggested recovery import to rollback the last transactions. Attempting to do so freezes the import command and nothing is happening after some panics in the log. (I tried with the zfs_recover =1 option)

What worked is importing the pool in read only mount with recovery: zpool import -F -N -o readonly=on ext I don't care about the broken pool, as I could send all the snapshots of the data in readonly mode to another pool and most of the backup data too (one dataset snapshot send/recv failed to complete, I don't know why). But maybe there is an issue with recovery code in case of encryption?

I always use the latest kernel and zfs master version on this machine (root filesystem is also on an ZFS SSD pool)

Describe how to reproduce the problem

not easily possible I guess

Include any warning/errors/backtraces from the system logs

root@token:~# zpool import
   pool: ext
     id: 15081949372965998973
  state: FAULTED
 status: The pool metadata is corrupted.
 action: The pool cannot be imported due to damaged devices or data.
        The pool may be active on another system, but can be imported using
        the '-f' flag.
   see: http://zfsonlinux.org/msg/ZFS-8000-72
 config:

        ext                                      FAULTED  corrupted data
          usb-ORICO_6518US3_0123456789ABCDE-0:0  ONLINE
root@token:~# zpool import ext
cannot import 'ext': I/O error
        Recovery is possible, but will result in some data loss.
        Returning the pool to its state as of Mon 27 Nov 2017 09:05:11 CET
        should correct the problem.  Approximately 61 seconds of data
        must be discarded, irreversibly.  Recovery can be attempted
        by executing 'zpool import -F ext'.  A scrub of the pool
        is strongly recommended after recovery.
root@token:~# zpool import -N -F ext

Nov 30 11:48:25 token kernel: SPL: Loaded module v0.7.0-21_ged19bcc
Nov 30 11:48:26 token kernel: ZFS: Loaded module v0.7.0-198_g56d8d8a, ZFS pool version 5000, ZFS filesystem version 5
Nov 30 11:49:06 token kernel: ZFS: Unable to set "noop" scheduler for /dev/disk/by-id/usb-ORICO_6518US3_0123456789ABCDE-0:0-part1 (sdb): 256
Nov 30 11:49:06 token kernel: ZFS: Unable to set "noop" scheduler for /dev/disk/by-id/usb-ORICO_6518US3_0123456789ABCDE-0:0-part1 (sdb): 256
Nov 30 11:49:06 token kernel: ZFS: Unable to set "noop" scheduler for /dev/disk/by-id/usb-ORICO_6518US3_0123456789ABCDE-0:0-part1 (sdb): 256
Nov 30 11:49:06 token kernel: ZFS: Unable to set "noop" scheduler for /dev/disk/by-id/usb-ORICO_6518US3_0123456789ABCDE-0:0-part1 (sdb): 256
Nov 30 11:49:48 token kernel: ZFS: Unable to set "noop" scheduler for /dev/disk/by-id/usb-ORICO_6518US3_0123456789ABCDE-0:0-part1 (sdb): 256
Nov 30 11:49:48 token kernel: ZFS: Unable to set "noop" scheduler for /dev/disk/by-id/usb-ORICO_6518US3_0123456789ABCDE-0:0-part1 (sdb): 256
Nov 30 11:49:48 token kernel: ZFS: Unable to set "noop" scheduler for /dev/disk/by-id/usb-ORICO_6518US3_0123456789ABCDE-0:0-part1 (sdb): 256
Nov 30 11:49:48 token kernel: ZFS: Unable to set "noop" scheduler for /dev/disk/by-id/usb-ORICO_6518US3_0123456789ABCDE-0:0-part1 (sdb): 256
Nov 30 11:49:53 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, zio->io_bookmark.zb_objset, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Nov 30 11:49:53 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, zio->io_bookmark.zb_objset, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Nov 30 11:49:53 token kernel: PANIC at zio.c:3705:zio_encrypt()
Nov 30 11:49:53 token kernel: Showing stack for process 3705
Nov 30 11:49:53 token kernel: CPU: 0 PID: 3705 Comm: z_wr_iss Tainted: P           OE   4.14.0-pf4+ #1
Nov 30 11:49:53 token kernel: PANIC at zio.c:3705:zio_encrypt()
Nov 30 11:49:53 token kernel: Hardware name: Acer Aspire V5-572G/Dazzle_CX , BIOS V2.07 07/12/2013
Nov 30 11:49:53 token kernel: Call Trace:
Nov 30 11:49:53 token kernel: Showing stack for process 3704
Nov 30 11:49:53 token kernel:  dump_stack+0x63/0x8b
Nov 30 11:49:53 token kernel:  spl_dumpstack+0x42/0x50 [spl]
Nov 30 11:49:53 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, zio->io_bookmark.zb_objset, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Nov 30 11:49:53 token kernel: PANIC at zio.c:3705:zio_encrypt()
Nov 30 11:49:53 token kernel: Showing stack for process 3703
Nov 30 11:49:53 token kernel:  spl_panic+0xc8/0x110 [spl]
Nov 30 11:49:53 token kernel:  ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Nov 30 11:49:53 token kernel:  zio_encrypt+0x5d0/0x6b0 [zfs]
Nov 30 11:49:53 token kernel:  ? zio_buf_free+0x58/0x60 [zfs]
Nov 30 11:49:53 token kernel:  ? zio_write_compress+0x496/0x6a0 [zfs]
Nov 30 11:49:53 token kernel:  ? __schedule+0x3ca/0x890
Nov 30 11:49:53 token kernel:  zio_execute+0x8a/0xf0 [zfs]
Nov 30 11:49:53 token kernel:  taskq_thread+0x2ab/0x4d0 [spl]
Nov 30 11:49:53 token kernel:  ? wake_up_q+0x70/0x70
Nov 30 11:49:53 token kernel:  ? zio_reexecute+0x3d0/0x3d0 [zfs]
Nov 30 11:49:53 token kernel:  kthread+0x108/0x140
Nov 30 11:49:53 token kernel:  ? taskq_thread_should_stop+0x70/0x70 [spl]
Nov 30 11:49:53 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:49:53 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:49:53 token kernel:  ret_from_fork+0x25/0x30
Nov 30 11:49:53 token kernel: CPU: 1 PID: 3704 Comm: z_wr_iss Tainted: P           OE   4.14.0-pf4+ #1
Nov 30 11:49:53 token kernel: Hardware name: Acer Aspire V5-572G/Dazzle_CX , BIOS V2.07 07/12/2013
Nov 30 11:49:53 token kernel: Call Trace:
Nov 30 11:49:53 token kernel:  dump_stack+0x63/0x8b
Nov 30 11:49:53 token kernel:  spl_dumpstack+0x42/0x50 [spl]
Nov 30 11:49:53 token kernel:  spl_panic+0xc8/0x110 [spl]
Nov 30 11:49:53 token kernel:  ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Nov 30 11:49:53 token kernel:  zio_encrypt+0x5d0/0x6b0 [zfs]
Nov 30 11:49:53 token kernel:  ? zio_buf_free+0x58/0x60 [zfs]
Nov 30 11:49:53 token kernel:  ? zio_write_compress+0x496/0x6a0 [zfs]
Nov 30 11:49:53 token kernel:  ? __schedule+0x3ca/0x890
Nov 30 11:49:53 token kernel:  zio_execute+0x8a/0xf0 [zfs]
Nov 30 11:49:53 token kernel:  taskq_thread+0x2ab/0x4d0 [spl]
Nov 30 11:49:53 token kernel:  ? wake_up_q+0x70/0x70
Nov 30 11:49:53 token kernel:  ? zio_reexecute+0x3d0/0x3d0 [zfs]
Nov 30 11:49:53 token kernel:  kthread+0x108/0x140
Nov 30 11:49:53 token kernel:  ? taskq_thread_should_stop+0x70/0x70 [spl]
Nov 30 11:49:53 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:49:53 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:49:53 token kernel:  ret_from_fork+0x25/0x30
Nov 30 11:49:53 token kernel: CPU: 2 PID: 3703 Comm: z_wr_iss Tainted: P           OE   4.14.0-pf4+ #1
Nov 30 11:49:53 token kernel: Hardware name: Acer Aspire V5-572G/Dazzle_CX , BIOS V2.07 07/12/2013
Nov 30 11:49:53 token kernel: Call Trace:
Nov 30 11:49:53 token kernel:  dump_stack+0x63/0x8b
Nov 30 11:49:53 token kernel:  spl_dumpstack+0x42/0x50 [spl]
Nov 30 11:49:53 token kernel:  spl_panic+0xc8/0x110 [spl]
Nov 30 11:49:53 token kernel:  ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Nov 30 11:49:53 token kernel:  zio_encrypt+0x5d0/0x6b0 [zfs]
Nov 30 11:49:53 token kernel:  ? zio_buf_free+0x58/0x60 [zfs]
Nov 30 11:49:53 token kernel:  ? zio_write_compress+0x496/0x6a0 [zfs]
Nov 30 11:49:53 token kernel:  ? __schedule+0x3ca/0x890
Nov 30 11:49:53 token kernel:  zio_execute+0x8a/0xf0 [zfs]
Nov 30 11:49:53 token kernel:  taskq_thread+0x2ab/0x4d0 [spl]
Nov 30 11:49:53 token kernel:  ? wake_up_q+0x70/0x70
Nov 30 11:49:53 token kernel:  ? zio_reexecute+0x3d0/0x3d0 [zfs]
Nov 30 11:49:53 token kernel:  kthread+0x108/0x140
Nov 30 11:49:53 token kernel:  ? taskq_thread_should_stop+0x70/0x70 [spl]
Nov 30 11:49:53 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:49:53 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:49:53 token kernel:  ret_from_fork+0x25/0x30
Nov 30 11:53:37 token kernel: INFO: task zpool:3532 blocked for more than 120 seconds.
Nov 30 11:53:37 token kernel:       Tainted: P           OE   4.14.0-pf4+ #1
Nov 30 11:53:37 token kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 30 11:53:37 token kernel: zpool           D    0  3532   2766 0x00000000
Nov 30 11:53:37 token kernel: Call Trace:
Nov 30 11:53:37 token kernel:  __schedule+0x3c2/0x890
Nov 30 11:53:37 token kernel:  schedule+0x36/0x80
Nov 30 11:53:37 token kernel:  cv_wait_common+0x101/0x140 [spl]
Nov 30 11:53:37 token kernel:  ? wait_woken+0x80/0x80
Nov 30 11:53:37 token kernel:  __cv_wait+0x15/0x20 [spl]
Nov 30 11:53:37 token kernel:  txg_wait_synced+0xdc/0x130 [zfs]
Nov 30 11:53:37 token kernel:  spa_load+0x1920/0x20d0 [zfs]
Nov 30 11:53:37 token kernel:  spa_load_best+0x182/0x280 [zfs]
Nov 30 11:53:37 token kernel:  spa_import+0x208/0x740 [zfs]
Nov 30 11:53:37 token kernel:  zfs_ioc_pool_import+0x130/0x140 [zfs]
Nov 30 11:53:37 token kernel:  zfsdev_ioctl+0x590/0x620 [zfs]
Nov 30 11:53:37 token kernel:  do_vfs_ioctl+0xa1/0x5e0
Nov 30 11:53:37 token kernel:  ? putname+0x53/0x60
Nov 30 11:53:37 token kernel:  SyS_ioctl+0x79/0x90
Nov 30 11:53:37 token kernel:  entry_SYSCALL_64_fastpath+0x1e/0xa9
Nov 30 11:53:37 token kernel: RIP: 0033:0x7fdf0bcdcf07
Nov 30 11:53:37 token kernel: RSP: 002b:00007fff01d9fbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 30 11:53:37 token kernel: RAX: ffffffffffffffda RBX: 00000000017e7420 RCX: 00007fdf0bcdcf07
Nov 30 11:53:37 token kernel: RDX: 00007fff01d9fc50 RSI: 0000000000005a02 RDI: 0000000000000003
Nov 30 11:53:37 token kernel: RBP: 0000000000000000 R08: 00000000017f6230 R09: 00000000017f4de0
Nov 30 11:53:37 token kernel: R10: 000000000000027a R11: 0000000000000246 R12: 0000000000000000
Nov 30 11:53:37 token kernel: R13: 000000000000002d R14: 00000000017e7910 R15: 00000000017e7420
Nov 30 11:53:37 token kernel: INFO: task z_wr_iss:3703 blocked for more than 120 seconds.
Nov 30 11:53:37 token kernel:       Tainted: P           OE   4.14.0-pf4+ #1
Nov 30 11:53:37 token kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 30 11:53:37 token kernel: z_wr_iss        D    0  3703      2 0x80000000
Nov 30 11:53:37 token kernel: Call Trace:
Nov 30 11:53:37 token kernel:  __schedule+0x3c2/0x890
Nov 30 11:53:37 token kernel:  schedule+0x36/0x80
Nov 30 11:53:37 token kernel:  spl_panic+0xfa/0x110 [spl]
Nov 30 11:53:37 token kernel:  ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Nov 30 11:53:37 token kernel:  zio_encrypt+0x5d0/0x6b0 [zfs]
Nov 30 11:53:37 token kernel:  ? zio_buf_free+0x58/0x60 [zfs]
Nov 30 11:53:37 token kernel:  ? zio_write_compress+0x496/0x6a0 [zfs]
Nov 30 11:53:37 token kernel:  ? __schedule+0x3ca/0x890
Nov 30 11:53:37 token kernel:  zio_execute+0x8a/0xf0 [zfs]
Nov 30 11:53:37 token kernel:  taskq_thread+0x2ab/0x4d0 [spl]
Nov 30 11:53:37 token kernel:  ? wake_up_q+0x70/0x70
Nov 30 11:53:37 token kernel:  ? zio_reexecute+0x3d0/0x3d0 [zfs]
Nov 30 11:53:37 token kernel:  kthread+0x108/0x140
Nov 30 11:53:37 token kernel:  ? taskq_thread_should_stop+0x70/0x70 [spl]
Nov 30 11:53:37 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:53:37 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:53:37 token kernel:  ret_from_fork+0x25/0x30
Nov 30 11:53:37 token kernel: INFO: task z_wr_iss:3704 blocked for more than 120 seconds.
Nov 30 11:53:37 token kernel:       Tainted: P           OE   4.14.0-pf4+ #1
Nov 30 11:53:37 token kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 30 11:53:37 token kernel: z_wr_iss        D    0  3704      2 0x80000000
Nov 30 11:53:37 token kernel: Call Trace:
Nov 30 11:53:37 token kernel:  __schedule+0x3c2/0x890
Nov 30 11:53:37 token kernel:  schedule+0x36/0x80
Nov 30 11:53:37 token kernel:  spl_panic+0xfa/0x110 [spl]
Nov 30 11:53:37 token kernel:  ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Nov 30 11:53:37 token kernel:  zio_encrypt+0x5d0/0x6b0 [zfs]
Nov 30 11:53:37 token kernel:  ? zio_buf_free+0x58/0x60 [zfs]
Nov 30 11:53:37 token kernel:  ? zio_write_compress+0x496/0x6a0 [zfs]
Nov 30 11:53:37 token kernel:  ? __schedule+0x3ca/0x890
Nov 30 11:53:37 token kernel:  zio_execute+0x8a/0xf0 [zfs]
Nov 30 11:53:37 token kernel:  taskq_thread+0x2ab/0x4d0 [spl]
Nov 30 11:53:37 token kernel:  ? wake_up_q+0x70/0x70
Nov 30 11:53:37 token kernel:  ? zio_reexecute+0x3d0/0x3d0 [zfs]
Nov 30 11:53:37 token kernel:  kthread+0x108/0x140
Nov 30 11:53:37 token kernel:  ? taskq_thread_should_stop+0x70/0x70 [spl]
Nov 30 11:53:37 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:53:37 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:53:37 token kernel:  ret_from_fork+0x25/0x30
Nov 30 11:53:37 token kernel: INFO: task z_wr_iss:3705 blocked for more than 120 seconds.
Nov 30 11:53:37 token kernel:       Tainted: P           OE   4.14.0-pf4+ #1
Nov 30 11:53:37 token kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 30 11:53:37 token kernel: z_wr_iss        D    0  3705      2 0x80000000
Nov 30 11:53:37 token kernel: Call Trace:
Nov 30 11:53:37 token kernel:  __schedule+0x3c2/0x890
Nov 30 11:53:37 token kernel:  schedule+0x36/0x80
Nov 30 11:53:37 token kernel:  spl_panic+0xfa/0x110 [spl]
Nov 30 11:53:37 token kernel:  ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Nov 30 11:53:37 token kernel:  zio_encrypt+0x5d0/0x6b0 [zfs]
Nov 30 11:53:37 token kernel:  ? zio_buf_free+0x58/0x60 [zfs]
Nov 30 11:53:37 token kernel:  ? zio_write_compress+0x496/0x6a0 [zfs]
Nov 30 11:53:37 token kernel:  ? __schedule+0x3ca/0x890
Nov 30 11:53:37 token kernel:  zio_execute+0x8a/0xf0 [zfs]
Nov 30 11:53:37 token kernel:  taskq_thread+0x2ab/0x4d0 [spl]
Nov 30 11:53:37 token kernel:  ? wake_up_q+0x70/0x70
Nov 30 11:53:37 token kernel:  ? zio_reexecute+0x3d0/0x3d0 [zfs]
Nov 30 11:53:37 token kernel:  kthread+0x108/0x140
Nov 30 11:53:37 token kernel:  ? taskq_thread_should_stop+0x70/0x70 [spl]
Nov 30 11:53:37 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:53:37 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:53:37 token kernel:  ret_from_fork+0x25/0x30
Nov 30 11:53:37 token kernel: INFO: task txg_sync:3850 blocked for more than 120 seconds.
Nov 30 11:53:37 token kernel:       Tainted: P           OE   4.14.0-pf4+ #1
Nov 30 11:53:37 token kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 30 11:53:37 token kernel: txg_sync        D    0  3850      2 0x80000000
Nov 30 11:53:37 token kernel: Call Trace:
Nov 30 11:53:37 token kernel:  __schedule+0x3c2/0x890
Nov 30 11:53:37 token kernel:  ? zio_nowait+0xb6/0x150 [zfs]
Nov 30 11:53:37 token kernel:  schedule+0x36/0x80
Nov 30 11:53:37 token kernel:  io_schedule+0x16/0x40
Nov 30 11:53:37 token kernel:  cv_wait_common+0xb2/0x140 [spl]
Nov 30 11:53:37 token kernel:  ? wait_woken+0x80/0x80
Nov 30 11:53:37 token kernel:  __cv_wait_io+0x18/0x20 [spl]
Nov 30 11:53:37 token kernel:  zio_wait+0xfd/0x1b0 [zfs]
Nov 30 11:53:37 token kernel:  dsl_pool_sync+0x1d1/0x420 [zfs]
Nov 30 11:53:37 token kernel:  spa_sync+0x41d/0xda0 [zfs]
Nov 30 11:53:37 token kernel:  txg_sync_thread+0x2d4/0x4a0 [zfs]
Nov 30 11:53:37 token kernel:  ? txg_delay+0x170/0x170 [zfs]
Nov 30 11:53:37 token kernel:  thread_generic_wrapper+0x72/0x80 [spl]
Nov 30 11:53:37 token kernel:  kthread+0x108/0x140
Nov 30 11:53:37 token kernel:  ? __thread_exit+0x20/0x20 [spl]
Nov 30 11:53:37 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:53:37 token kernel:  ? kthread_create_on_node+0x70/0x70
Nov 30 11:53:37 token kernel:  ret_from_fork+0x25/0x30
tcaputi commented 6 years ago

I'll try to get a USB drive on Monday and look into this. The error here is that the code is attempting to encrypt new data before you've loaded your keys. The zpool import -F code must be attempting to write some data somewhere as part of the recovery process. I'll see if I can figure out what's going on.

tcaputi commented 6 years ago

I think I know what's going on. This should be a simple 1 line fix. Unfortunately, the problem is very hard for me to reproduce and I have not been able to test if it works yet. I will add it to #6864

phreaker0 commented 6 years ago

That sounds good, thx for your work. I still have the original pool and also a dd backup of it, so I can test any patch you provide.

tcaputi commented 6 years ago

If you have it and have some time to test it: https://pastebin.com/wBJV59a8

phreaker0 commented 6 years ago

I will, but the disk is at work so it takes till tomorrow.

phreaker0 commented 6 years ago

Unfortunately I get the same Panic again with the patch applied:

Dez 05 09:19:53 token kernel: SPL: Loaded module v0.7.0-21_ged19bcc
Dez 05 09:19:53 token kernel: icp: module license 'CDDL' taints kernel.
Dez 05 09:19:53 token kernel: Disabling lock debugging due to kernel taint
Dez 05 09:19:54 token kernel: ZFS: Loaded module v0.7.0-204_g7b34070, ZFS pool version 5000, ZFS filesystem version 5
Dez 05 09:21:01 token kernel: ZFS: Unable to set "noop" scheduler for /dev/disk/by-id/usb-ORICO_6518US3_0123456789ABCDE-0:0-part1 (sdb): 256
Dez 05 09:21:08 token kernel: ZFS: Unable to set "noop" scheduler for /dev/disk/by-id/usb-ORICO_6518US3_0123456789ABCDE-0:0-part1 (sdb): 256
Dez 05 09:22:20 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, zio->io_bookmark.zb_objset, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Dez 05 09:22:20 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, zio->io_bookmark.zb_objset, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Dez 05 09:22:20 token kernel: PANIC at zio.c:3705:zio_encrypt()
Dez 05 09:22:20 token kernel: Showing stack for process 3575
Dez 05 09:22:20 token kernel: CPU: 2 PID: 3575 Comm: z_wr_iss Tainted: P           OE   4.14.0-pf5+ #1
Dez 05 09:22:20 token kernel: Hardware name: Acer Aspire V5-572G/Dazzle_CX , BIOS V2.07 07/12/2013
Dez 05 09:22:20 token kernel: PANIC at zio.c:3705:zio_encrypt()
Dez 05 09:22:20 token kernel: Call Trace:
Dez 05 09:22:20 token kernel: Showing stack for process 3574
Dez 05 09:22:20 token kernel:  dump_stack+0x63/0x8b
Dez 05 09:22:20 token kernel:  spl_dumpstack+0x42/0x50 [spl]
Dez 05 09:22:20 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, zio->io_bookmark.zb_objset, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Dez 05 09:22:20 token kernel: PANIC at zio.c:3705:zio_encrypt()
Dez 05 09:22:20 token kernel: Showing stack for process 3576
Dez 05 09:22:20 token kernel:  spl_panic+0xc8/0x110 [spl]
Dez 05 09:22:20 token kernel:  ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Dez 05 09:22:20 token kernel:  zio_encrypt+0x5d0/0x6b0 [zfs]
Dez 05 09:22:20 token kernel:  ? zio_buf_free+0x58/0x60 [zfs]
Dez 05 09:22:20 token kernel:  ? zio_write_compress+0x496/0x6a0 [zfs]
Dez 05 09:22:20 token kernel:  ? __schedule+0x3ca/0x890
Dez 05 09:22:20 token kernel:  zio_execute+0x8a/0xf0 [zfs]
Dez 05 09:22:20 token kernel:  taskq_thread+0x2ab/0x4d0 [spl]
Dez 05 09:22:20 token kernel:  ? wake_up_q+0x70/0x70
Dez 05 09:22:20 token kernel:  ? zio_reexecute+0x3d0/0x3d0 [zfs]
Dez 05 09:22:20 token kernel:  kthread+0x108/0x140
Dez 05 09:22:20 token kernel:  ? taskq_thread_should_stop+0x70/0x70 [spl]
Dez 05 09:22:20 token kernel:  ? kthread_create_on_node+0x70/0x70
Dez 05 09:22:20 token kernel:  ret_from_fork+0x25/0x30
...
tcaputi commented 6 years ago

Hmmmm. And you're sure that you are using the new module? That's unfortunate. I was really pretty sure that this would have fixed the problem..... I'll try to come up with some steps to reproduce it.

phreaker0 commented 6 years ago

I just recompiled to make sure I was, but same result. I'm using the DKMS method and I verified that the source for dkms (/usr/src/zfs-0.7.0/module/zfs/zil.c) contains the patch.

tcaputi commented 6 years ago

Ok. And I'm assuming you re-inserted the module after installing. DKMS doesn't actually do this for you so you would still be running the old code. But if you rebooted it should load the (new) installed version.

I will see if I can reproduce the issue locally.

phreaker0 commented 6 years ago

Yeah, I rebooted to make sure.

phreaker0 commented 6 years ago

I tested with the latest master version (v0.7.0-277_g0d23f5e) and unfortunately I still have the same issue:

root@token:~$ zpool import
   pool: ext
     id: 15081949372965998973
  state: FAULTED
 status: The pool metadata is corrupted.
 action: The pool cannot be imported due to damaged devices or data.
        The pool may be active on another system, but can be imported using
        the '-f' flag.
   see: http://zfsonlinux.org/msg/ZFS-8000-72
 config:

        ext         FAULTED  corrupted data
          sdb       ONLINE
root@token:~$ zpool import ext
cannot import 'ext': I/O error
        Recovery is possible, but will result in some data loss.
        Returning the pool to its state as of Mo 27 Nov 2017 09:05:11 CET
        should correct the problem.  Approximately 61 seconds of data
        must be discarded, irreversibly.  Recovery can be attempted
        by executing 'zpool import -F ext'.  A scrub of the pool
        is strongly recommended after recovery.
root@token:~$ zpool import -N -F ext
Feb 05 10:26:42 token kernel:  sdb: sdb1 sdb9
Feb 05 10:26:42 token kernel: sd 6:0:0:0: [sdb] Attached SCSI disk
Feb 05 10:26:42 token kernel: SPL: Loaded module v0.7.0-24_g23602fd
Feb 05 10:26:42 token kernel: icp: module license 'CDDL' taints kernel.
Feb 05 10:26:42 token kernel: Disabling lock debugging due to kernel taint
Feb 05 10:26:44 token kernel: ZFS: Loaded module v0.7.0-277_g0d23f5e, ZFS pool version 5000, ZFS filesystem version 5
Feb 05 10:27:15 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:28:07 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:28:07 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:28:07 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:28:07 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:29:01 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:29:01 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:29:01 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:29:02 token kernel: ZFS: Unable to set "noop" scheduler for /dev/sdb1 (sdb): 256
Feb 05 10:29:10 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, dsobj, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Feb 05 10:29:10 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, dsobj, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Feb 05 10:29:10 token kernel: PANIC at zio.c:3844:zio_encrypt()
Feb 05 10:29:10 token kernel: Showing stack for process 26845
Feb 05 10:29:10 token kernel: PANIC at zio.c:3844:zio_encrypt()
Feb 05 10:29:10 token kernel: CPU: 3 PID: 26845 Comm: z_wr_iss Tainted: P           OE    4.15.0-pf1+ #1
Feb 05 10:29:10 token kernel: Hardware name: Acer Aspire V5-572G/Dazzle_CX , BIOS V2.07 07/12/2013
Feb 05 10:29:10 token kernel: Showing stack for process 26846
Feb 05 10:29:10 token kernel: Call Trace:
Feb 05 10:29:10 token kernel:  dump_stack+0x63/0x8b
Feb 05 10:29:10 token kernel:  spl_dumpstack+0x42/0x50 [spl]
Feb 05 10:29:10 token kernel:  spl_panic+0xc8/0x110 [spl]
Feb 05 10:29:10 token kernel:  ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Feb 05 10:29:10 token kernel:  ? spl_kmem_cache_free+0x139/0x1c0 [spl]
Feb 05 10:29:10 token kernel:  zio_encrypt+0x68a/0x770 [zfs]
Feb 05 10:29:10 token kernel:  ? zio_write_compress+0x496/0x6a0 [zfs]
Feb 05 10:29:10 token kernel:  ? __schedule+0x3cd/0x890
Feb 05 10:29:10 token kernel:  zio_execute+0x8a/0xf0 [zfs]
Feb 05 10:29:10 token kernel:  taskq_thread+0x2ab/0x4d0 [spl]
Feb 05 10:29:10 token kernel:  ? wake_up_q+0x70/0x70
Feb 05 10:29:10 token kernel:  ? zio_reexecute+0x430/0x430 [zfs]
Feb 05 10:29:10 token kernel:  kthread+0x102/0x140
Feb 05 10:29:10 token kernel:  ? taskq_thread_should_stop+0x70/0x70 [spl]
Feb 05 10:29:10 token kernel:  ? kthread_associate_blkcg+0x90/0x90
Feb 05 10:29:10 token kernel:  ? kthread_associate_blkcg+0x90/0x90
Feb 05 10:29:10 token kernel:  ret_from_fork+0x35/0x40
Feb 05 10:29:10 token kernel: CPU: 1 PID: 26846 Comm: z_wr_iss Tainted: P           OE    4.15.0-pf1+ #1
Feb 05 10:29:10 token kernel: Hardware name: Acer Aspire V5-572G/Dazzle_CX , BIOS V2.07 07/12/2013
Feb 05 10:29:10 token kernel: Call Trace:
Feb 05 10:29:10 token kernel:  dump_stack+0x63/0x8b
Feb 05 10:29:10 token kernel:  spl_dumpstack+0x42/0x50 [spl]
Feb 05 10:29:10 token kernel:  spl_panic+0xc8/0x110 [spl]
Feb 05 10:29:10 token kernel:  ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Feb 05 10:29:10 token kernel:  ? spl_kmem_cache_free+0x139/0x1c0 [spl]
Feb 05 10:29:10 token kernel:  zio_encrypt+0x68a/0x770 [zfs]
Feb 05 10:29:10 token kernel:  ? zio_write_compress+0x496/0x6a0 [zfs]
Feb 05 10:29:10 token kernel:  ? __schedule+0x3cd/0x890
Feb 05 10:29:10 token kernel:  zio_execute+0x8a/0xf0 [zfs]
Feb 05 10:29:10 token kernel:  taskq_thread+0x2ab/0x4d0 [spl]
Feb 05 10:29:10 token kernel:  ? wake_up_q+0x70/0x70
Feb 05 10:29:10 token kernel:  ? zio_reexecute+0x430/0x430 [zfs]
Feb 05 10:29:10 token kernel:  kthread+0x102/0x140
Feb 05 10:29:10 token kernel:  ? taskq_thread_should_stop+0x70/0x70 [spl]
Feb 05 10:29:10 token kernel:  ? kthread_associate_blkcg+0x90/0x90
Feb 05 10:29:10 token kernel:  ret_from_fork+0x35/0x40
Feb 05 10:29:10 token kernel: VERIFY3(0 == spa_do_crypt_abd(B_TRUE, spa, dsobj, bp, zio->io_txg, psize, zio->io_abd, eabd, iv, mac, salt, &no_crypt)) failed (0 == 2)
Feb 05 10:29:10 token kernel: PANIC at zio.c:3844:zio_encrypt()
Feb 05 10:29:10 token kernel: Showing stack for process 26847
Feb 05 10:29:10 token kernel: CPU: 1 PID: 26847 Comm: z_wr_iss Tainted: P           OE    4.15.0-pf1+ #1
Feb 05 10:29:10 token kernel: Hardware name: Acer Aspire V5-572G/Dazzle_CX , BIOS V2.07 07/12/2013
Feb 05 10:29:10 token kernel: Call Trace:
Feb 05 10:29:10 token kernel:  dump_stack+0x63/0x8b
Feb 05 10:29:10 token kernel:  spl_dumpstack+0x42/0x50 [spl]
Feb 05 10:29:10 token kernel:  spl_panic+0xc8/0x110 [spl]
Feb 05 10:29:10 token kernel:  ? spa_do_crypt_abd+0x80/0x2f0 [zfs]
Feb 05 10:29:10 token kernel:  ? spl_kmem_cache_free+0x139/0x1c0 [spl]
Feb 05 10:29:10 token kernel:  zio_encrypt+0x68a/0x770 [zfs]
Feb 05 10:29:10 token kernel:  ? zio_write_compress+0x496/0x6a0 [zfs]
Feb 05 10:29:10 token kernel:  ? __wake_up_common_lock+0x8e/0xc0
Feb 05 10:29:10 token kernel:  zio_execute+0x8a/0xf0 [zfs]
Feb 05 10:29:10 token kernel:  taskq_thread+0x2ab/0x4d0 [spl]
Feb 05 10:29:10 token kernel:  ? wake_up_q+0x70/0x70
Feb 05 10:29:10 token kernel:  ? zio_reexecute+0x430/0x430 [zfs]
Feb 05 10:29:10 token kernel:  kthread+0x102/0x140
Feb 05 10:29:10 token kernel:  ? taskq_thread_should_stop+0x70/0x70 [spl]
Feb 05 10:29:10 token kernel:  ? kthread_associate_blkcg+0x90/0x90
Feb 05 10:29:10 token kernel:  ret_from_fork+0x35/0x40
tcaputi commented 6 years ago

@phreaker0 Hmmmm. Can i ask how large this pool is and how sensitive the data on it is? At this point, I have checked all of the code paths that I can think of related to pool import and I can't find anything that might be causing this. Without being able to reproduce the issue myself, I'm a bit lost on what to do. If I could poke around with it I might be able to figure it out.

Feel free to email me at tcaputi@datto.com as well.

phreaker0 commented 6 years ago

@tcaputi The pool is 1TB in size with about 80% used. It only contains encrypted datasets so I have no problem if you need access to it. But I think it is too big for sending the whole thing. Maybe I can share the disk as block device via iSCSI or something (I never used iSCSI but it sounds like it would be useful here) or do you think it would be too slow?

tcaputi commented 6 years ago

ISCSI will almost definitely not work over WAN. The only way I can think to do this is if you can plug the drive into a spare machine somewhere that you don't mind me having root access to (maybe passthrough to a VM). You could then expose the port over the WAN and privately email me ssh credentials to the VM so I can access it. I will need to be able to reboot this VM a few times to figure out what's going on. I understand if this is too much to ask.

phreaker0 commented 6 years ago

no problem, i will setup a vm and email you the details when ready

tcaputi commented 6 years ago

Ok. I have found the cause of the issue and will be making a PR for it shortly. Basically, ZIL claiming was accidentally causing user accounting to occur even though the keys weren't loaded yet. Thanks for the help with this @phreaker0 .