openzfsonosx / zfs

OpenZFS on OS X
https://openzfsonosx.org/
Other
824 stars 72 forks source link

offline / online during attach fails to produce a safe mirror #784

Open pgdh opened 3 years ago

pgdh commented 3 years ago

Try this ...

# dd if=/dev/urandom bs=1024k count=10240 of=$SOMEPATH/d1
# zpool create play $SOMEPATH/d1
# dd if=/dev/urandom bs=1024k count=8192 of=/Volumes/play/f1
# dd if=/dev/urandom bs=1024k count=10240 of=$SOMEPATH/d2
# zpool attach play $SOMEPATH/d1 $SOMEPATH/d2
(wait a few seconds)
# zpool offline play $SOMEPATH/d2
(wait a few seconds, confirm that resilver is still running with zpool status play)
# zpool online play $SOMEPATH/d2
(wait until resilver is finished, checking with zpool status play)

Here's one I made earlier ...

# zpool status play
  pool: play
 state: ONLINE
  scan: resilvered 1.90G in 0 days 00:00:20 with 0 errors on Fri Jan 22 17:40:44 2021
config:

    NAME                        STATE     READ WRITE CKSUM
    play                        ONLINE       0     0     0
      mirror-0                  ONLINE       0     0     0
        /Volumes/touch/tmp/d1   ONLINE       0     0     0
        /Volumes/touch/tmp/d2   ONLINE       0     0     0

errors: No known data errors
#

All fine and dandy, right?

But then ...

# zpool scrub play
(wait for scrub to finish, again checking with zpool status play)
# zpool status play
  pool: play
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 6.10G in 0 days 00:00:34 with 0 errors on Fri Jan 22 17:42:30 2021
config:

    NAME                        STATE     READ WRITE CKSUM
    play                        ONLINE       0     0     0
      mirror-0                  ONLINE       0     0     0
        /Volumes/touch/tmp/d1   ONLINE       0     0     0
        /Volumes/touch/tmp/d2   ONLINE       0     0 48.8K

errors: No known data errors
# zpool -V
zfs-1.9.4-0
zfs-kmod-1.9.4-0
# uname -a
Darwin Holistix-MBP.local 19.6.0 Darwin Kernel Version 19.6.0: Thu Oct 29 22:56:45 PDT 2020; root:xnu-6153.141.2.2~1/RELEASE_X86_64 x86_64
#

i.e. Catalina 10.15.7

This is not reproducible on SmartOS ...

# uname -a
SunOS ingleby 5.11 joyent_20201217T173522Z i86pc i386 i86pc
# 

or Linux (Proxmox) ...

# zpool -V
zfs-0.8.5-pve1
zfs-kmod-0.8.5-pve1
# uname -a
Linux annie 5.4.73-1-pve #1 SMP PVE 5.4.73-1 (Mon, 16 Nov 2020 10:52:16 +0100) x86_64 GNU/Linux
#

The above is a contrived case, but I started investigating when it happened for ral as I was slurping data between a couple of Samsung T7 Touch drives (as part of a process of turning off the T7's native encryption).

Both Linux and SmartOS stop the resilver as soon as one drive is taken offline, and resilvers from the beginning when it is brought back online. This is what OpenZFSonOSX needs to do.

Sometime soon, I will dip my toes in OpenZFS 2.0 on macOS port. It may well be that this bug disappears there, in which case it's another win for the unified code.