openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.54k stars 1.74k forks source link

RAIDZ1: unable to replace a drive with itself #2076

Closed mcrbids closed 1 year ago

mcrbids commented 10 years ago

Trying to simulate failure scenarios with a 3+1 RAIDZ1 array in order to prepare for eventualities.

# create spfstank raidz1 -o ashift=12 sda sdb sdc sdd 
# zfs create spfstank/part
# dd if=/dev/random of=/spfstank/part/output.txt bs=1024 count=10000

manually pull out /dev/sdc without shutting anything down. As expected, zpool status shows the drive in a bad state:

# zpool status 
-- SNIP -- 
ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637  UNAVAIL     16   122     0  corrupted data
-- SNIP -- 

This status doesn't change when I re-insert the drive. So I want to simulate re-introducing a drive that's extremely incoherent relative to the state of the ZFS pool. So, making sure that the drive is "offline", I introduce a raft of changes:

# zpool offline spfstank ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637
# dd if=/dev/zero of=/dev/disk/by-id/ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 bs=1024 count=100000

102 MB of changes, to be exact. Now, I want to re-introduce the drive to the pool and get ZFS to work it out. At this point, the status of the drive is:

# zpool status 
-- SNIP -- 
ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637  OFFLINE     16   122     0
-- SNIP -- 

I try to replace the drive with itself:

# zpool replace spfstank ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 -f 
cannot replace ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 with ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637: ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 is busy

# zpool replace spfstank /dev/sdc /dev/sdc -f 
invalid vdev specification
the following errors must be manually repaired:
/dev/sdc1 is part of active pool 'spfstank'

# zpool replace spfstank /dev/sdc  -f 
invalid vdev specification
the following errors must be manually repaired:
/dev/sdc1 is part of active pool 'spfstank'

I was able to "fix" this with:

# zpool online spfstank ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637
# zpool clear spfstank
# /sbin/zpool scrub spfstank

During the scrub, the status of the drive changes:

zpool status 
-- SNIP -- 
ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637  ONLINE       0     0     9  (repairing)
-- SNIP -- 

There doesn't seem to be a way to "replace" a known incoherent drive with itself.

dweeezil commented 10 years ago

You didn't corrupt the disk enough. The dd left the 3rd and 4th copies of the labels intact so it's still being recognized as a part of the pool. All you need to do in this case is to zpool online it. The only part of a vdev that's in a specific location are the labels; 2 at the beginning and 2 near the end. So long as any one of them is intact, you'd likely need extremely severe damage to prevent a simple "online" from working (due to the multiple copies of all metadata).

nedbass commented 10 years ago

Or as was mentioned on the mailing list, zpool labelclear -f /dev/sdc should let zpool replace work to simulate a drive swap.

joshenders commented 10 years ago

@dweeezil is it safe to assume that after the disk is online'd and the scrub finishes, the disk in the FAULTED state will return to the ONLINE state? UPDATE: After the scrub completed the disk is still in the faulted state.

This may be better mailing list fodder but I'm noticing similar behavior as @mcrbids and I believe this is on topic. I hope you don't mind.

Here is the zpool configuration:

config:

        NAME        STATE     READ WRITE CKSUM
        data        DEGRADED     0     0     0
          raidz2-0  ONLINE       0     0     0
            A0      ONLINE       0     0     0
            B0      ONLINE       0     0     0
            C0      ONLINE       0     0     0
            D0      ONLINE       0     0     0
            E0      ONLINE       0     0     0
            F0      ONLINE       0     0     0
          raidz2-1  DEGRADED     0     0     0
            A1      OFFLINE      0     0     0
            B1      ONLINE       0     0     0
            C1      ONLINE       0     0     0
            D1      ONLINE       0     0     0
            E1      ONLINE       0     0     0
            F1      ONLINE       0     0     0
          raidz2-2  DEGRADED     0     0     0
            A2      ONLINE       0     0     0
            B2      ONLINE       0     0     0
            C2      OFFLINE      0     0     0
            D2      ONLINE       0     0     0
            E2      ONLINE       0     0     0
            F2      OFFLINE      0     0     0
          raidz2-3  ONLINE       0     0     0
            A3      ONLINE       0     0     0
            B3      ONLINE       0     0     0
            C3      ONLINE       0     0     0
            D3      ONLINE       0     0     0
            E3      ONLINE       0     0     0
            F3      ONLINE       0     0     0

I have attempted to "borrow" a disk from one of the N+2 vdevs (raidz2-1) to the vdev at N (raidz2-2) by offline'ing A1 and zero'ing the the first few hundred megs.

# zpool offline data A1
# dd if=/dev/zero of=/dev/disk/by-vdev/A1 bs=64M count=10

I then edited my /etc/zfs/vdev_id.conf so that udev will give A1 the label of C2 and commented the existing line that defines C2.

I then removed A1 and C2 and placed A1 in C2's drive tray. I reconnected the new C2. udev triggers and /dev/disk/by-vdev/C2 now exists.

# ls -l /dev/disk/by-vdev/C2*
lrwxrwxrwx 1 root root 9 Jul 29 16:15 /dev/disk/by-vdev/C2 -> ../../sdu

When I attempt to replace the offline'd C2 with the new C2 however, I get a message that C2 is busy and the disk is automatically partitioned. By zfs I assume.

# zpool replace data C2 /dev/disk/by-vdev/C2
invalid vdev specification
use '-f' to override the following errors:
/dev/disk/by-vdev/C2 contains a corrupt primary EFI label.
# zpool replace -f data C2 /dev/disk/by-vdev/C2
cannot replace C2 with /dev/disk/by-vdev/C2: /dev/disk/by-vdev/C2 is busy
# ls -l /dev/disk/by-vdev/C2*
lrwxrwxrwx 1 root root  9 Jul 29 16:16 /dev/disk/by-vdev/C2 -> ../../sdu
lrwxrwxrwx 1 root root 10 Jul 29 16:16 /dev/disk/by-vdev/C2-part1 -> ../../sdu1
lrwxrwxrwx 1 root root 10 Jul 29 16:16 /dev/disk/by-vdev/C2-part9 -> ../../sdu9

Note, the "corrupt primary EFI label" message is always present even with brand new disks that have never touched the system. Not sure what that is about. I always have to use -f when replacing.

If I had to take a guess, this has something to do with the fact I created the pool with the /dev/disk/by-vdev/ labels and not /dev/disk/by-id/. ZFS sees the path, "/dev/disk/by-id/C2" and assumes it is just badly damaged (and as I've learned from this thread, a label still exists at a location higher than the first several hundred megs I overwrote). Am I close here? UPDATE: Doesn't appear to be related to which symlink was used when referencing the disk.

Would the correct course of action in replacing a disk this way, be to just zpool online the "borrowed" disk if I need to borrow disks from other vdevs in the future. UPDATE: No. zpool online will not resilver the faulted disk. zpool replace will not allow disk reuse within the pool which I believe to be a bug.

joshenders commented 10 years ago

I think there might actually be a bug here as of 0.6.3. Even if I zpool labelclear the disk I still cannot use it as a replacement in this pool.

# zpool replace -f data C2 /dev/disk/by-id/scsi-SATA_ST3000DM001-1CH_XXXXXXX
invalid vdev specification
the following errors must be manually repaired:
/dev/disk/by-id/scsi-SATA_ST3000DM001-1CH_XXXXXXX is part of active pool '

As seen in the post above, the system automatically partitions the drive without my intervention. There must be some signaling beyond the zfs label on the drive that informs zfs that this disk is/was a member of this pool.

After I zero'd the drive fully with dd, I was able to use it as a replacement disk.

...
          raidz2-2  DEGRADED     0     0     0
            A2      ONLINE       0     0     0
            B2      ONLINE       0     0     0
        replacing-2                      OFFLINE      0     0     0
          old                            OFFLINE      0     0     0
          C2                             ONLINE       0     0     0  (resilvering)
            D2      ONLINE       0     0     0
            E2      ONLINE       0     0     0
            F2      OFFLINE      0     0     0
...
DeHackEd commented 10 years ago

You have to zpool labelclear the partition on the disk, not just the whole disk. Even if you give ZFS a whole disk it makes partitions on it and you have to clear those.

joshenders commented 10 years ago

Noted. That's a lot less time consuming than wiping the disk. Thanks!

gitbisector commented 10 years ago

zpool labelclear scsi-SATA_ST3000DM001-1CH_XXXXXXX-part1 complains about the disk being part of an active pool too. Tried that after a zpool offline /dev/disk/by/id/scsi-SATA_ST3000DM001-1CH_XXXXXXX

To workaround this I moved the disk to another system and did the zpool labelclear there.

After that 'zpool replace -f tank scsi-SATA_ST3000DM001-1CH_XXXXXXX /dev/disk/by-id/scsi-SATA_ST3000DM001-1CH_XXXXXXX got me to resilvering.

gordan-bobic commented 8 years ago

It would be really handy to be able to do this without physically removing the disk. A prime example of the use case is when changing partitions around, e.g. dropping a partition to make more space for a zfs one.

Spongman commented 7 years ago

i'm running in to this as well. i don't understand, how is this not considered a bug any more?

labelclear is clearly broken: it's impossible to clear a partition that was created as part of a whole-disk pool.

also, 'labelclear -f'ing the drive doesn't do enough to prevent the error 'does not contain an EFI label but it may contain information in the MBR'

why is it even necessary for the user to reason about partitions that they didn't create?

rueberger commented 6 years ago

I believe I'm running into this problem.

sudo zpool status

 state: ONLINE
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 29h17m with 0 errors on Mon Jun 11 05:41:41 2018
config:

        NAME         STATE     READ WRITE CKSUM
        tank         ONLINE       0     0     0
          raidz1-0   ONLINE       0     0     0
            sda-enc  ONLINE       0     0     0
            sdb-enc  ONLINE       0     0     0
            sdc-enc  ONLINE       0     0     0
        logs
          log        ONLINE       0     0     0
        cache
          cache      FAULTED      0     0     0  corrupted data
          cache      ONLINE       0     0     0

errors: No known data error

Cache is a logical volume on a LUKs drive. I must have done something wrong with the setup and it is not properly recognized on reboot.

sudo zpool replace -f tank cache /dev/disk/by-id/dm-name-ws1--vg-cache

cannot open '/dev/disk/by-id/dm-name-ws1--vg-cache': Device or resource busy
cannot replace cache with /dev/disk/by-id/dm-name-ws1--vg-cache: no such device in pool

sudo zpool labelclear /dev/disk/by-id/dm-name-ws1--vg-cache

labelclear operation failed.
        Vdev /dev/disk/by-id/dm-name-ws1--vg-cache is a member (L2CACHE), of pool "tank".
        To remove label information from this device, export or destroy
        the pool, or remove /dev/disk/by-id/dm-name-ws1--vg-cache from the configuration of this pool
        and retry the labelclear operation.

Any insights greatly appreciated.

EDIT: I should clarify that the cache seems to be in use, which explains why the device is busy. So it's maybe just a minor annoyance that the old cache is unable to be removed?

EDIT: sorry I must have just been being dumb about the paths.. I was able to remove the degraded device with sudo zpool remove tank /dev/ws1-vg/cache...

shevek commented 6 years ago

I have this issue too. I can't labelclear an offline disk to reinsert it in the pool.

shevek commented 6 years ago

Workaround: strace -e pread64 zdb -l $DEV >/dev/null

Gives a bunch of offsets:

pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 0) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 262144) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127614976) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127877120) = 262144

Clout these offsets with dd and charlie's your uncle.

Here's your free firearm. You'll find what remains of your foot somewhere near the end of your leg.

TheLinuxGuy commented 5 years ago

Clout these offsets with dd and charlie's your uncle.

Here's your free firearm. You'll find what remains of your foot somewhere near the end of your leg.

For the un-initiated, do you have a sample command and do we need to divide the numbers by the disk block size? e.g: offset 12000127614976 from your example divided by 512 block size = 23437749248

shevek commented 5 years ago

You don't need optimality, just firepower. Use dd with byte units and no division is required. Anyway, I can't math.

devZer0 commented 4 years ago

did anybody try wipefs? that also seems to be able to remove zfs information from the disks without overwriting as a whole...

scintilla13 commented 4 years ago

I've tried wipefs -a and it doesn't work.

dev-sngy commented 4 years ago

did anybody try wipefs? that also seems to be able to remove zfs information from the disks without overwriting as a whole...

According to the man page:

   When option -a is used, all magic strings that are visible for
   libblkid are erased. In this case the wipefs scans the device again
   after each modification (erase) until no magic string is found.

   Note that by default wipefs does not erase nested partition tables on
   non-whole disk devices.  For this the option --force is required.

So I tried:

wipefs -all --force

But that didn't work for me...

dev-sngy commented 4 years ago

Workaround: strace -e pread64 zdb -l $DEV >/dev/null

Gives a bunch of offsets:

pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 0) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 262144) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127614976) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127877120) = 262144

Clout these offsets with dd and charlie's your uncle.

Here's your free firearm. You'll find what remains of your foot somewhere near the end of your leg.

These sayings like "firearm" and "charlie's your uncle" are extremely not intuitive for a foreigner like me :(

Can you provide a dd commands example for the more unlearned of us out here? (i.e. to clarify which parameter is used to do what from this strace output.)

Thanks in advance.

shevek commented 4 years ago

Translating the pread() results from MY drives roughly into dd commands gives:

dd if=/dev/zero of=$DEV bs=1 seek=0 count=262144
dd if=/dev/zero of=$DEV bs=1 seek=262144 count=262144
dd if=/dev/zero of=$DEV bs=1 seek=12000127614976 count=262144
dd if=/dev/zero of=$DEV bs=1 seek=12000127877120 count=262144

However, the pread() values will differ for YOUR drive(s), so I strongly recommend you learn to load and aim your own firearm. The trick with dd is to use bs=1 when you don't want performance and can't do mathematics (like me).

mddeff commented 4 years ago

@shevek - floor sufficiently swiss cheezed from weapons fire, and still no joy; (Edit, see end of comment)

Background

dozer1 had 2 disks in mirror, sds1 and sdr1. At somepoint sdl (previously usb drive) was removed, and either through reboot or some other means, udev moved sds to sdl. Disk is 14.6T; full DD would take 3.16 days.

Initial status

[root@fs01 etc]# zpool status
  pool: dozer1
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: none requested
config:

    NAME                      STATE     READ WRITE CKSUM
    dozer1                    DEGRADED     0     0     0
      mirror-0                DEGRADED     0     0     0
        sdr                   ONLINE       0     0     0
        17256646544208471230  OFFLINE      0     0     0  was /dev/sds1

Trying to clear


[root@fs01 etc]# strace -e pread64 zdb -l /dev/sdl >/dev/null
pread64(5, "\0\1\0\0\0\0\0\0\1\0\0\0000\0\0\0\7\0\0\0\1\0\0\0\23\0\0\0doze"..., 13920, 0) = 13920
pread64(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 0) = 262144
pread64(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 262144) = 262144
pread64(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 16000900136960) = 262144
pread64(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 16000900399104) = 262144
+++ exited with 2 +++
[root@fs01 etc]# for f in 0 262144 16000900136960 16000900399104; do dd if=/dev/zero of=/dev/sdl bs=1 seek=$f count=262144; done
262144+0 records in
262144+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.507745 s, 516 kB/s
262144+0 records in
262144+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.508549 s, 515 kB/s
262144+0 records in
262144+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.499234 s, 525 kB/s
262144+0 records in
262144+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.496669 s, 528 kB/s

[root@fs01 etc]# partprobe /dev/sdl
### LSBLK shows sdl has no partitions, so far so good

[root@fs01 etc]# zpool replace -f dozer1 17256646544208471230 /dev/sdl
cannot replace 17256646544208471230 with /dev/sdl: /dev/sdl is busy, or device removal is in progress
### LSBLK shows:
...
sdl               8:176  0  14.6T  0 disk 
├─sdl1            8:177  0  14.6T  0 part 
└─sdl9            8:185  0     8M  0 part 
...
[root@fs01 etc]# zpool replace -f dozer1 17256646544208471230 /dev/sdl
invalid vdev specification
the following errors must be manually repaired:
/dev/sdl1 is part of active pool 'dozer1'
[root@fs01 etc]# zpool labelclear -f /dev/sdl1
/dev/sdl1 is a member (ACTIVE) of pool "dozer1"

When I try to offline/delete /dev/sdl1, ZFS says its not in the pool (I'm assuming because it's checking cache?). When I try to add it, it checks the metadata and says its already part of the pool.

Success!

So doing a zpool detach dozer1 17256646544208471230 and then zpool attach dozer1 /dev/sdr /dev/sdl worked like a charm! Crumbs for those who need it.

That being said, the fact that labelclear doesn't work as intended is still an issue.

newboydj169 commented 3 years ago

hi Deff, when I issue this command, zpool will tell me: cannot detach 6936166286967168998: no valid replicas 6936166286967168998 is the disk fault and I want to replace or remove. any idea, thanks in advance.

Joe

So doing a zpool detach dozer1 17256646544208471230 and then zpool attach dozer1 /dev/sdr /dev/sdl worked like a charm! Crumbs for those who need it.

HLeithner commented 3 years ago

Had the same problem this weekend, tried all the variants but in the end I was unable to replace the disk with it self. I was able to import the raid with the "broken" disk but got chksum errors on this is disk (of course because the data was wrong). zfs was not able to rebuild the data on this disk with scrub, after a couple of minutes of resilvering it creates to many chksum errors and set the device es failed. Cleaning this status doesn't help much, even doing it every minute. The resilvering completed but still chksum errors.

In the end I destroyed the raid and recreated it from scratch. In the beginning I wanted just to add the disk again, later I wanted to see if zfs is robust and admin friendly enough to fix such situation... sadly not.

joshenders commented 3 years ago

Had the same problem this weekend, tried all the variants but in the end I was unable to replace the disk with it self.

I was able to import the raid with the "broken" disk but got chksum errors on this is disk (of course because the data was wrong).

zfs was not able to rebuild the data on this disk with scrub, after a couple of minutes of resilvering it creates to many chksum errors and set the device es failed. Cleaning this status doesn't help much, even doing it every minute. The resilvering completed but still chksum errors.

In the end I destroyed the raid and recreated it from scratch. In the beginning I wanted just to add the disk again, later I wanted to see if zfs is robust and admin friendly enough to fix such situation... sadly not.

Understandable as you're new to ZFS but this sounds like a bit of PEBKAC and not any fault of the underlying technology.

If your disk has unrecoverable read or write errors which are surfacing as checksum errors in ZFS, you shouldn't be attempting to replace it with itself, you should be replacing it with a known-good spare.

This robust and admin-friendly filesystem is trying to save your data from you yourself. This bug is a very low priority and unusual corner case. Let's not insult the thankless hard work of others in bug reports please.

HLeithner commented 3 years ago

Understandable as you're new to ZFS but this sounds like a bit of PEBKAC and not any fault of the underlying technology.

wrong assumption and sound aggressive but ok

If your disk has unrecoverable read or write errors which are surfacing as checksum errors in ZFS, you shouldn't be attempting to replace it with itself, you should be replacing it with a known-good spare.

Actually to check if the drive is defect or not is the challenge here, I replaced the drive with a new one and got the same problem so I moved the old disk to a new system and tested it on this. Since on the other system the drive worked without problems I moved it back to the original system in another drive bay, that worked. So it seems a cable, the controller or the backplane is broken. But I was unable to reuse the drive in the old raid because of the error above. I'm not sure if this is really a border case

This robust and admin-friendly filesystem is trying to save your data from you yourself. This bug is a very low priority and unusual corner case. Let's not insult the thankless hard work of others in bug reports please.

That wasn't my intention, sorry if this sound like this. I really love zfs and how it evolves. The intention of my comment was to maybe increase the priority or at least to notices that this may not be an edge case.

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.