MD RAID headers can cause Linux to steal disks from ZFS

ryao commented 12 years ago

I received a report from a veteran Gentoo user that MD RAID had stolen two disks from a raidz2 vdev. It seems that ZFS leaves a hole for a bootloader that an MD RAID header can occupy, causing this issue. This occurred on partitions that were 0xBF, not 0xFD.

I have found myself recommending people do mdadm --zero-superblock ... before creating a ZFS pool to avoid this situation. It would be a good idea to modify the code to zero this at vdev creation to preclude any possibility that an old MD RAID header will cause the kernel module to steal the disks.

I am not sure if I have time to do this right now, but I figured that I would file a report so that people are aware of the issue.

behlendorf commented 12 years ago

Indeed. Yes we should probably zero out that entire region when zfs is given the whole disk and not just a partition. If someone wants to put this patch together I'm happy to review it.

rlaager commented 12 years ago

Why not zero the space when using a partition? Is it not reserved then?

ryao commented 12 years ago

It should be zeroed in either situation. I imagine that we should also zero upon adding a disk to a vdev or replacing a disk in a vdev.

behlendorf commented 12 years ago

Yes, I suppose we should always just zero that region of the vdev label. Have patch, will apply. See vdev_label_init() which is where this should be done.

ryao commented 12 years ago

It seems that we will be safe if we zero the magic number locations and leave everything else alone. I will write a patch to do this in the next few days.

For anyone interested in the technical details, here are links to the documentation that I am using:

http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/ondiskformat0822.pdf https://raid.wiki.kernel.org/articles/r/a/i/RAID_superblock_formats_fd05.html

rlaager commented 12 years ago

I would prefer to see the whole region zeroed. It'll cover unknown cases and is probably the same amount of code. Why would that be a problem?

ryao commented 12 years ago

We would need to zero the entire device in order to guarantee that no unknown magic numbers are left on it. That ignores the possibility that the magic numbers could be written in these locations by normal means after pool creation. Such occurrences are incredibly improbable, but possible.

behlendorf commented 12 years ago

Zeroing the unused space in the label is the best we can do. Writing out the entire device would obviously be intolerably slow. However, we may be able to provide a cross check that we did a reasonable thing.

The blkid utility is continually updated to be able to recognize the majority of filesystem/partition types by reading these magic numbers. Several years ago now we pushed the patches for it to correctly identify zfs filesystems, I expect they are in all the current distributions by now. There are even autoconf checks in the project to detect if your version of blkid supports zfs. Anyway, a good safety check would be to have blkid re-id the device after we write our label to ensure it identifies itself correctly.

rlaager commented 12 years ago

The correct behavior is to discard the whole disk or partition at zpool create time using ioctl(BLKDISCARD) on devices or fallocate(FALLOC_FL_PUNCH_HOLE) on files. If those return EOPNOTSUPP, ignore it. If they return any other non-zero status, that's a real error, so bail. If they return zero, that's success. On success for files, set a flag to TRUE that says you've zero'ed it. On success for devices, set that flag with ioctl(BLKDISCARDZEROS).

Then, if that flag is false, actually write zeros to the unused/reserved-for-bootloader space in the label. (This covers the non-zeroing discard and no discard support (at either runtime or compile-time) scenarios.)

See lib/ext2fs/unix_io.c in e2fsprogs for a working example of this sort of thing. But just the writing zeros to the label would be a good step in the right direction (and that code isn't wasted if someone writes the discard stuff later).

ryao commented 12 years ago

Would BLKDISCARD function on regular hard disks? I was under the impression that it only applied to SSDs. Using either BLKDISCARD or FALLOC_FL_PUNCH_HOLE when giving storage to ZFS is a good idea though.

With that said, I think that we should verify that we can apply this to disks, partitions and files before writing code to do this.

rlaager commented 12 years ago

mke2fs isn't broken on spinning disks. If the device doesn't support discard, BLKDISCARD will return EOPNOTSUPP, which I mentioned. In that case, the "I've zero'ed the disk" flag would be false and you fall back to actually writing zeros to the bootloader space. Obviously, really writing zeros to the whole disk isn't feasible, but my point was that discarding it is, and is the correct behavior.

samarium commented 12 years ago

IIRC md metadata 0.9x is at the end of the disk, so should overwrite there too, unless ZFS already writes a label there sufficiently large to eliminate the problem.

aikudinov commented 11 years ago

TRIM patches seems to be already doing that: https://github.com/dechamps/zfs/commit/082e2b0e16a3a77bacd71793d98511416156bac5, https://github.com/dechamps/zfs/commit/c85f097e3c4ac86b97b4f0b89ff5612df8e4cbc9

ps. I also recently created test pool with 2 old ex-raid drives and on next reboot dmraid stolen one of them and zfs could only see it again after I wiped dmraid stuff and rebooted - then pool imported just fine.

jvsalo commented 10 years ago

This just happened to one of our production servers today... not cool.

root@backup:~# cat /proc/mdstat 
Personalities : [raid1] 
md127 : inactive sdt1[1](S)
      131006464 blocks super 1.2

md2 : inactive sdu1[6](S) sdv1[7](S) sdr1[3](S) sds1[0](S)
      335282176 blocks super 1.2

md0 : active raid1 sdc1[0] sdd1[1]
      78084096 blocks super 1.2 [2/2] [UU]

md0 there is legit, md2 and md127 are WTF.

After mdadm -S md2 && mdadm -S md127, I was able to import the pool again.

This is 0.6.2 from Debian repositories.

Now that I started thinking about it, the devices that were UNAVAIL were once part of an mdraid RAID5 array.

booi commented 10 years ago

It sounds like those drives were once part of an MD array. There's not much ZFS can do about this? You really need to wipe the drives (at least the first few megabytes and the last few megabytes) or use mdadm to remove superblock information before adding them to ZFS.

tomposmiko commented 10 years ago

'mdadm --zero-superblock' can be used for that.

jvsalo commented 10 years ago

Yeah, I don't really consider it a ZFS issue any more. I used wipefs, by the way.

booi commented 10 years ago

Will wipefs remove dmraid headers at the end? Why have I never heard of this magical tool...

On Thu, Mar 20, 2014 at 2:32 PM, Jaakko notifications@github.com wrote:

Yeah, I don't really consider it a ZFS issue any more. I used wipefs, by the way.

Reply to this email directly or view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/634#issuecomment-38223688 .

mailinglists35 commented 9 years ago

excuse me, but how is this a zfs issue? it should be closed as not a bug. it is the responsability of the admin to ensure a whole disk is cleared of any data if he wants to use it for something else! that implies either wiping it entirely (dd/hdparm) or knowing what to delete from previous storage solution (mdadm zero superblock)

behlendorf commented 9 years ago

It's not explicitly a ZFS issue, but it is something which we could handle more cleanly simply by zeroing out the full ZFS label. Obviously it's a low priority, but it is also a nice small project for someone who wants to start digging in to the code.

Harvie commented 6 years ago

Is it safe to do mdadm --zero-superblock on production ZFS? I am afraid that it might destroy ZFS metadata. Is there safer command to zero these using dd on existing ZFS partition/disk?

rincebrain commented 6 years ago

@Harvie Depending on the md format version, probably not entirely - ISTR md metadata format 1.x stashes stuff toward the end of the disk, which is also IIRC where ZFS stashes one of the labels.

I would guess it's probably safe if ZFS hasn't already overwritten the labels with its own data, which is how you'd be in this circumstance anyway, but I'd probably scrub the pool afterward to (hopefully) ensure nothing got unexpectedly clobbered, assuming you have the redundancy for that to not be fatal.

If you don't want to be risky at all, you could wait for the TRIM work to land (or, I guess, the thick provisioning work would also do), or make a dataset with compression=off and fill it with zeroes until MD stops seeing a superblock.

Harvie commented 6 years ago

@rincebrain is there way to backup the metadata before zeroing superblock using mdadm? filling dataset with zeroes sounds time consuming and will cause downtime. :(

rincebrain commented 6 years ago

@drescherjm My suggestion was filling the unallocated parts of the pool with zeroes, not just mdadm --zero-superblock, since I'm not certain enough of it being harmless to suggest someone use that.

Harvie commented 6 years ago

@rincebrain That is too slow and resource heavy operation. There must be some possibility to zero eg. single byte using dd just to damage the mdadm header, so it won't detect.

rlaager commented 3 years ago

@behlendorf I've boldly reopened this for your reconsideration.

I think this is a ZFS issue to some extent. Here's why...

Example scenario: The administrator has an existing MD RAID pool. He wants to convert to ZFS. He removes a disk from the MD array, runs wipefs on the disk, and runs zpool create. He copies the data over, then stops the MD array, runs wipefs on the second disk, and attaches it to the pool. This seems like a sane procedure, but the result is that both disks still have valid MD superblocks and there is a very serious risk of bad things happening (i.e. MD treats it as an array). This just happened to someone I know.

This is (at least somewhat) a ZFS issue because in a wholedisk scenario, ZFS is creating the partitions. It's specifically the fact that ZFS puts a partition at the usual spot (where an MD partition would have been) that makes this dangerous. The admin can wipefs the disk but that doesn't actually get rid of the MD superblock because it is in the partition.

I think ZFS should explicitly zero out the space used by MD superblocks (of the various formats) relative to wherever it is creating its labels. In the wholedisk scenario, ideally it would run that zeroing relative to the disk first and then (after partitioning) the partition.

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Harvie commented 2 years ago

I think ZFS should explicitly zero out the space used by MD superblocks

Sounds reasonable

behlendorf commented 2 years ago

Given that we've told ZFS it's allowed to use that space this sounds pretty reasonable to me. If someone wants to propose a patch and open a PR that would be welcome.

teslazap commented 2 years ago

this is exactly what happened to me, two recycled disks entirely allocated to zfs got "stolen" by mdadm. that was sneaky because it happened after several weeks and reboots, mdadm wasn't installed when the two disks were first attached, it was installed only recently and the whole thing happened after the first reboot with mdadm present. I stopped the mdadm array, removed the mdadm package and the pool reappeared intact (phew!). I agree with several comments above, this shouldn't happen.

openzfs / zfs

MD RAID headers can cause Linux to steal disks from ZFS #634