openzfs / openzfs-docs

OpenZFS Documentation
https://openzfs.github.io/openzfs-docs/
135 stars 194 forks source link

HWE kernel causing boot failure for fresh install on Ubuntu 22.04 #388

Closed Shellcat-Zero closed 1 year ago

Shellcat-Zero commented 1 year ago

I had some disks fail in a RAID-10 system and decided to take the opportunity to rebuild the pool with the latest LTS, one stripe at a time. The old system was 18.04. I was anticipating also doing some incremental hardware updates (aside from disk replacement) so I opted for the HWE kernel. This caused the system to hang at boot (prior to a grub menu) with messages like:

error: compression algorithm 71 not supported
error: unsupported embedded BP (type=1)
error: compression algorithm 32 not supported
error: compression algorithm 103 not supported
error: compression algorithm 75 not supported
error: compression algorithm 117 not supported
error: compression algorithm inherit not supported
error: unsupported embedded BP (type=236)
error: unknown device 1

It might have booted eventually, but I never let it run for more than 5 minutes. Nothing I tried with regards to modifying the bpool features worked, such as a solution I mentioned on a very similar previous issue. I had also tried removing zpool_checkpoint and livelist since they generate warnings at pool creation, but that resulted in a boot which failed into the grub prompt. After conceding that nothing else could probably be done with bpool features, I tried going with the generic kernel instead of HWE, which has resulted in success. Now, those error messages will appear and continue erroring for 1-2 minutes before eventually giving me a grub menu, whereas it had previously spat out those messages and stopped erroring (or apparently doing anything entirely which would indicate boot progress). Those messages were absent in the previous 18.04 startup. It would be nice to circumvent that 1-2 minute erroring if anyone has suggestions.

The system uses an x58 chipset with an Intel processor, and does not support UEFI. I guess the lesson here is definitely DO NOT use HWE with older hardware, but I just wanted to mention it here because I haven't seen any other warning/documentation for this kind of issue, and troubleshooting it took an enormous amount of time.

Unfortunately, dual-boot between both pools does not work. Only the 22.04 system is bootable, but I can still mount the old pool for data-copying purposes before I eventually assemble the RAID-10 back together. The 18.04 system had been upgraded from 16.04, which did not have a bpool and therefore 18.04 still had no bpool. Running update-grub while the old pool is mounted within 22.04 results in it being added to the boot menu, but attempting to boot to 18.04 results in this boot failure, which gives me a busybox prompt and then an initramfs prompt before finally yielding a kernel panic:

mount: mounting /dev on /root/dev failed: not such file or directory
mount: mounting /run on /root/run failed: no such file or directory
run-init: current directory on the same filesystem as the root: error 0
Target filesystem doesn't have the requested /sbin/init
No init found. Try passing init= bootarg.

I tried manually mounting the old rpool to no avail, and mounting the new bpool also did not work. I'm inclined to believe this is happening because of 18.04 lacking a bpool, but I have no idea. I would like to believe that this is the only issue, because future upgrades with this process would be ideal, splitting the RAID-10 in two and maintaining two systems temporarily until it can be fully reassembled back into the RAID-10 pool (while still leveraging external backups).

One other (superfluous) thing to note for others, is that the options GRUB_DEFAULT=saved, GRUB_SAVEDEFAULT=true result in a pre-boot error message, GRUB error: sparse files not supported, which I believe is caused by the fact that the filesystem is write-only for grub in the case of the bpool, but this does not prevent startup from happening.

I've done the 22.04 install now on 3 different systems for troubleshooting, and I would also note that the rpool cannot be exported prior to reboot, for whatever reason. Any export attempt fails with pool is busy messages, which then means that on the first startup the pool has to be imported manually to continue booting.

Shellcat-Zero commented 1 year ago

After more testing, it seems the boot problem might have had something to do the old root dataset not being mounted when update-grub was executed. After importing the old pool with a new name, zpool import -R /old -f <big_id_number> oldpool, then mounting the root dataset, zfs mount oldpool/ROOT/ubuntu, and running update-grub, the 1-2 minute erroring has gone away prior to the grub menu. I think previously update-grub was run without issuing that dataset mount command. The last fix was that I needed to edit the old fstab file in 18.04 to reflect the new pool name, oldpool, or else it failed in the boot process due to not finding and mounting /var/tmp and /var/log.

I re-attempted that fix when, after issuing update-grub without the old pool imported, I observed completely normal booting with it omitting the old pool from grub. I could have swore that I'd tried booting the HWE kernel while omitting the old pool from grub, but at this point I've tried so many things I can't keep track of what I did. I have no idea why the HWE kernel caused such a bad hang-up, but thankfully switching to generic allowed me to proceed and eventually discover what appears to have been the likely issue (needing that explicit zfs mount prior to update-grub).

Closing as solved.