rocky-linux / rocky-tools

MIT License
400 stars 139 forks source link

Rocky Linux does not start after migrate2linux (sysroot on software RAID 1) #163

Open andreabravetti opened 2 years ago

andreabravetti commented 2 years ago

I recently migrated a CentOS 8 server with software RAID 1.

The migrate2rocky script worked flawlessly but after reboot the server failed to start because of the switch-root service and you will end up in the recovery console.

I managed to start the server with this:

mdadm --assemble --scan mount /dev/md3 /sysroot logout

After this the server start normally and everything works.

I do not have the proof but it seems to me that md devices now have different naming from CentOS but I can't understand why.

I'm going to replicate this problem again on a vm to collect more details.

pajamian commented 2 years ago

Let me know what you find out. This will likely at least be a candidate for a note in the "known issues" section of the README file, that is if it's not feasible to fix it directly in the script itself.

komitov commented 2 years ago

This should fix this in the script: https://github.com/rocky-linux/rocky-tools/pull/162

andreabravetti commented 2 years ago

I failed to reproduce the problem on a test VM:

While the problem is still present on a server I have and I need to assemble and mount the raid at every boot on the new host everything works properly and the system boot without errors.

On the old production server I have:

Feb 20 20:21:39 sun systemd[1]: Starting Switch Root... Feb 20 20:21:39 sun systemctl[1256]: Failed to switch root: Specified switch root path '/sysroot' does not seem to be an OS tree. os-release file is missing. Feb 20 20:21:39 sun systemd[1]: initrd-switch-root.service: Main process exited, code=exited, status=1/FAILURE Feb 20 20:21:39 sun systemd[1]: initrd-switch-root.service: Failed with result 'exit-code'. Feb 20 20:21:39 sun systemd[1]: Failed to start Switch Root.

On the console I see no running md devices at all.

Step to (try to) reproduce:

Before you can install CentOS 8 you need a vm with two disks where you will create some raid before install.

Start the installer, ctrl+alt+F2, do some fdisk and mdadm:

For semplicity make identical layout on the two disks:

Device Boot Start End Sectors Size Id Type /dev/sda1 * 2048 2099199 2097152 1G 83 Linux /dev/sda2 2099200 6293503 4194304 2G 82 Linux swap / Solaris /dev/sda3 6293504 41943039 35649536 17G fd Linux raid autodetect

Then create the raid:

mdadm --create /dev/md3 --level=1 --raid-devices=2 /dev/sda3 /dev/sdb3

Reboot or it may see old disk layout or wrong /dev/md/3 size.

Restart the installer.

Choose custom installation destination and use existing /dev/md/3 for root (you will see it as "Unknown/Unknown/3"), choose reformat, ext4 and /, for boot and swap may use stand alone partition previously created.

Install CentOS on top of it, with root on a /dev/md/3 (note that on CentOS it will be called "/dev/md/3").

Now you have a working CentOS with root on raid:

[user@cotto2 ~]$ mount |grep md3 /dev/md3 on / type ext4 (rw,relatime,seclabel)

Let's migrate to Rocky:

curl https://raw.githubusercontent.com/rocky-linux/rocky-tools/main/migrate2rocky/migrate2rocky.sh -o migrate2rocky.sh chmod u+x migrate2rocky.sh sudo ./migrate2rocky.sh -r

After some time:

Done, please reboot your system.

This is almost the same thing I have done some year ago on the production server, except it has much more partitions.

andreabravetti commented 2 years ago

This should fix this in the script: #162

Great, but it is still not merged I don't understand why I'm failing to reproduce it.

komitov commented 2 years ago

Could you please try it to check if it works for you? The only difference is that it runs "grub2-mkconfig -o /boot/grub2/grub.cfg" and this should fix the boot issue.

andreabravetti commented 2 years ago

I just noticed the old server, installed at the time of CentOS 8.0, I have this:

admin@sun:~$ grep rd /etc/default/grub GRUB_CMDLINE_LINUX="biosdevname=0 crashkernel=auto nomodeset rd.auto=1 consoleblank=0"

Can rd.auto=1 be the cause of the problem?

I don't remember to have added this option manually years ago.

On the new VM just installed with CentOS 8.5 and then migrated to Rocky I have rd.md.uuid=2421ed30:2315ef0b:861800b6:f99953d0 instead.

andreabravetti commented 2 years ago

Could you please try it to check if it works for you? The only difference is that it runs "grub2-mkconfig -o /boot/grub2/grub.cfg" and this should fix the boot issue.

Yes, I'm going to try asap, but not now.

Maybe late in the weekend (it's a production server).

andreabravetti commented 2 years ago

Could you please try it to check if it works for you? The only difference is that it runs "grub2-mkconfig -o /boot/grub2/grub.cfg" and this should fix the boot issue.

I can confirm it fixed my boot issue.

Now the server boot normally.

Thank you!

sstonemen commented 2 years ago

We use >50 server with amd epic cpu. All use software raid, centos 8.5 and by every migration test the servers crash / hang after the REBOOT. Our solution was a new install. Your hint to run grub2-mkconfig after migrate2rocky.sh script solve our big problem. Thank you.

grub2-mkconfig -o /boot/grub2/grub.cfg

komitov commented 2 years ago

This pull request fixes exactly this issue https://github.com/rocky-linux/rocky-tools/pull/162