openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.72k forks source link

zfs-dracut boot failure with out of date zpool.cache - zfs_force not working #7050

Closed wphilips closed 3 years ago

wphilips commented 6 years ago

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 26
Linux Kernel any (e.g., 4.14.6-200.fc26.x86_64)
Architecture x86_64
ZFS Version v0.7.5-1
SPL Version v0.7.5-1

Describe the problem you're observing

I have several systems with root and boot on zfs. The systems boot with grub initramfs is generated by dracut with zfs-dracut-0.7.5-1.fc26.x86_64

The problem occurs whenever significant changes are made to the zpools attached to the system, or even when adding empty disks. Very often, dracut enters the emergency shell because it cannot import the pools based on the zpool.cache file. Even adding zfs_force as a kernel option does not work. E.g., I tried:

linux16 /boot/@/vmlinuz-4.14.13-200.fc26.x86_64 root=zfs:ssd/fc26 boot=ssd ro rd_NO_PLYMOUTH audit=0 zfs_force=1

The reason seems to be that the zpoool.cache file does not reflect the current (changed) configuration of the system. It is not clear why zfs.force or zfs_force does not work.

Here are 2 use cases:

  1. to defragment the pool on which the zfs root is installed, I attach a new disk, create a new zpool on it, copy all the data, remove the old disk, reboot and change some grub parameters so that it boots the new bool. Before the reboot, zpool.cache refers to the old pool on the old disk. Running 'dracut -f ...' will therefore copy the "old" zpool.cache into initamfs. After boot, the disks have changed and this zpool.cache is outdated.

  2. in a system with 3 rpools, I remove one of the disks which contains a non-essential rpool (after exporting it). I then add two new empty disks. The system boots into the dracut shell even though the root pool has not changed. The now missing, but non-essential pool prevents a normal boot.

It is possible to somewhat prevent these problems by removing zpool.cache, then running dracut and then rebooting. In this case, often dracut still enters the emergency shell claiming that the pool(s) are in use in another system, but by force importing them in the dracut shell and rebooting it is possible to boot the system. Then it is possible to recreate zpool.cache, and rerun dracut to create a working system. Alternatively, one can continue to use the initramfs with the missing zpool.cache.

It is probably also possible to create a zpool.cache file for the future new configuration, but it probably involves deleting the current one and it is easy to make a mistake.

In any case, make a simple mistake or forget to take these "preventive" measures and you end up with a system which will always enter the dracut emergency shell with no way to recover (except if you have e.g., a usb boot disk with zfs at hand. Even then it is really hard to recover).

In the good old days it also use to be possible to fix problems in the dracut shell and then continue to boot. These days, systemd prevents this from working (probably related to the message "transaction is destructive")

While fixing the zfs_force option would help, adding a configuration option to dracut to never create zfs.cache and/or adding a kernel command line option to ignore zpool.cache might also help.

PS. Even better would be to fix dracut or systemd so that a boot can continue after fixing problems in dracut. For instance, in the emergency shell you would remove the zpool.cache file and then type some command to continue boot. However, that is probably a more general (non zfsonlinux) issue.

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

stale[bot] commented 4 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.