kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
557 stars 242 forks source link

ship udev rules for btrfs raid #264

Open davide125 opened 4 years ago

davide125 commented 4 years ago

systemd currently ships a basic udev rule for btrfs. One outstanding issue is that if an array is degraded, it will not be mounted and boot can fail (if e.g. / is on a RAID1 and it gets degraded). We should probably ship a dedicated rule in btrfs-progs to handle this nicely, akin to what mdadm does, as systemd upstream has indicated they're not interested in maintaining RAID policy logic in there.

cmurf commented 4 years ago

This email raises a good point, that at least for the desktop case, some integration might be necessary in order to provide a proper notification to the user. https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/4HGKNDKW3VXWO3JX2KB3SVB6WY7HPAVQ/

cmurf commented 4 years ago

Related enhancement: https://gitlab.gnome.org/GNOME/gnome-shell/-/issues/2938

kdave commented 4 years ago

Is there a udev rule for the mdadm fallback? Allowing automatic degraded raid1 mounts on btrfs could be achieved by adding 'degraded' to fstab. IIRC writes won't be possible but I'd have to check, there were some recent changes relaxing the constraints.

Regarding the automation in the non-standard scenarios, we're taking a more cautious approach (failing mount if a serious condition is found) but providing ways to make it possible (like the degraded mount). From what I've read so far I think ther's no single solution to satisfy all types of users - eg. "degraded mount is fine" vs "how could you let the mount continue".

IIRC mdadm allows read-write for a degraded raid1 and I know people that use that as a cheap backup "solution", ie. use one disk and once in a while sync with the other. This works but is discouraged and cannot be considered a raid1. And I'd rather not allow that for btrfs. Mirrored disks out of sync may get out of sync even more and irreparably if each gets mounted in degraded mount and has some new writes the other device lacks. This could easily happen eg. a flaky SATA controler that shows different devices after boot, automatic degraded mount would happily take what it sees and then what.

The points in the mail is IMO in line with my understanding, but otherwise I agree that user notification about degraded mounts or other device problems is a good thing.

I still consider booting from raid1 as a valid usecase, but with limitations, so best we can do is to document it with good/bad practices and write some tests to simulate the degraded conditions.

danobi commented 4 years ago

EDIT: page lagged and didn't see kdave's comment before posting

Does this have to be implemented as a udev rule? What if anaconda adds degraded to the mount options. And then we could ship a systemd service (or gnome shell) that periodically scans for degraded mounts and warns users (via notify-send or equivalent)?

davide125 commented 4 years ago

It doesn't have to be a udev rule, that's just how mdadm does it. But if we can do this in btrfs-progs it's better, as that would cover all distros. Adding a systemd service for notification is a great idea.

josefbacik commented 4 years ago

This is a larger issue than btrfs, because we're talking about putting the user into this position where they're in a less safe mode automatically, without a way to indicate to them that somethings gone wrong.

Right now that's a lot of work that doesn't exist now. There's no clean way for systemd to say "Hey things are fucked, wants us to try something else?". It also cannot mount read only because it doesn't support read only booting (though it will soon).

I think providing a udev option for distros that want to choose this option is fine, because we know what the right magic to use for btrfs is. Then the distro can decide how it's going to warn the user and choose to use the udev rule if they wish.

cmurf commented 4 years ago

Bit of a tangent but the mdadm case looks like it involves both udev and dracut. I see a ~2 minute loop in dracut initqueue waiting for all devices to show up. I think this is it: https://github.com/dracutdevs/dracut/blob/master/modules.d/98dracut-systemd/dracut-initqueue.sh And after 2 minutes ends up here: https://github.com/dracutdevs/dracut/blob/master/modules.d/90mdraid/mdraid_start.sh

I don't know whether udev can do a wait on its own, or whether it can inject degraded mount option at the end of the wait. I'm skeptical of having degraded persistently set in fstab - it means every normal boot has:

[    6.306313] BTRFS info (device vda3): allowing degraded mounts

And mount command shows:

/dev/vda3 on /home type btrfs (rw,relatime,seclabel,degraded,compress=zstd:1,space_cache,subvolid=256,subvol=/home)

Also, without some way to track a prior degraded mount, if it was just a missing device and it later gets re-added, Btrfs doesn't have a proper way to handle it. The user must do a scrub to resync the member devices, or the volume is at risk.

cmurf commented 3 years ago

Does udev support a time out? Right now this udev rule results in an indefinite hang. It's still suboptimal, but would be much much better if the user gets dropped to a dracut prompt after a reasonable amount of time.