oracle / centos2ol

Script and documentation to switch CentOS/Rocky Linux to Oracle Linux
https://linux.oracle.com/switch/centos/
Universal Permissive License v1.0
346 stars 84 forks source link

Ensure the correct EFI boot entries are created after switching from CentOS to Oracle Linux #70

Open metal4lyf opened 3 years ago

metal4lyf commented 3 years ago

We are trying to migrate CentOS 8 systems to OL8.

The conversion script reports success, but it renders our systems unbootable: After the BIOS splash, we get several >> Checking media presence ..... messages on the terminal and then the system enters Dell BIOS recovery mode, which performs a memory test and then reports "No bootable devices found! ..."

Boot params are UEFI/Legacy Boot: OFF/Secure Boot: OFF.

I've isolated this issue to OL8 grub. Using a recovery stick, if I re-enable the CentOS BaseOS repo and install the latest version of grub2*, the system will boot to login with expected entries ("Oracle Linux" etc.) in the grub menu.

We're hesitant to proceed with migrations using this workaround because it requires us to continue using a potentially unsupported version of a fundamental component, not to mention we'll have to exclude grub in our dnf config to avoid bricking on dnf upgrades.

We use a stock grub configuration as far as I know.

/etc/default/grub:

GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="crashkernel=auto resume=/dev/mapper/VolGroup00-swap rd.lvm.lv=VolGroup00/root rd.lvm.lv=VolGroup00/swap rhgb quiet"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=true

/boot/grub2/grubenv:

# GRUB Environment Block
kernelopts=root=/dev/mapper/VolGroup00-root ro crashkernel=auto resume=/dev/mapper/VolGroup00-swap rd.lvm.lv=VolGroup00/root rd.lvm.lv=VolGroup00/swap rhgb quiet
boot_success=1
boot_indeterminate=0
######################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Kernel: 4.18.0-240.15.1.el8_3.x86_64 Bricking grub2: 1:2.02-90.0.2.el8_3.1 from ol8_baseos_latest Working grub2: <= 1:2.02-90.el8_3.1 from BaseOS

I've yet to see a useful message from grub despite removing rhgb quiet. Please let me know what other info would help here.

Djelibeybi commented 3 years ago

Does 1:2.02-90.0.1.el8 work? If so, that at least will narrow our focus to the fixes in the .0.2 release.

Djelibeybi commented 3 years ago

Also, can you tell us what type of device and controller you're using to boot?

Djelibeybi commented 3 years ago

Could you also try running grub2-install <boot device> prior to rebooting to see if that resolves the issue?

metal4lyf commented 3 years ago

Here's the boot device info. I'll try the grub install now.

$ sudo lshw -class disk
  *-disk
       description: ATA Disk
       product: ST2000DM001-1ER1
       physical id: 0.0.0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: CC27
       serial: Z4Z703D6
       size: 1863GiB (2TB)
       capacity: 1863GiB (2TB)
       capabilities: 7200rpm gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=6 guid=9d1775b4-835a-4c76-9e43-5d544b7ec8fc logicalsectorsize=512 sectorsize=4096
aburmash commented 3 years ago

Boot params are UEFI/Legacy Boot: OFF/Secure Boot: OFF.

To be on the same page, is system in UEFI mode or legacy ? To check you can check /sys/firmware/efi presence on the booted system.

metal4lyf commented 3 years ago

UEFI

metal4lyf commented 3 years ago

I can't get grub2-install working. It complains about missing modinfo.sh. No directory under /boot contains this file so I'm not sure what to pass it. Trying 90.0.1 now.

aburmash commented 3 years ago

Yeah, forget about grub2-install. It is for legacy. Please, just before the reboot do efibootmgr -v find /boot |grep redhat find /boot |grep centos rpm -qa |grep shim

metal4lyf commented 3 years ago

90.0.1 doesn't boot either. Reinstalled 90.0.2. Here are the results: efibootmgr -v:

BootCurrent: 0011
Timeout: 1 seconds
BootOrder: 0001,000C,000D,000E,000F,0010,0006,0011,0008,0009,000A,000B
Boot0000* Windows Boot Manager  HD(1,GPT,87d93515-2374-4b87-9701-5a4c527ee83b,0x800,0x145000)/File(\EFI\Microsoft\Boot\bootmgfw.efi)WINDOWS.........x...B.C.D.O.B.J.E.C.T.=.{.9.d.e.a.8.6.2.c.-.5.c.d.d.-.4.e.7.0.-.a.c.c.1.-.f.3.2.b.3.4.4.d.4.7.9.5.}...;................
Boot0001* CentOS Linux  HD(1,GPT,b7460ef9-456e-4086-95f9-7dc69e80ddaa,0x800,0x12c000)/File(\EFI\centos\shimx64.efi)
Boot0006* HDD   NVMe(0x1,01-00-00-00-00-00-00-00)/HD(1,GPT,54969a86-cdfd-4d17-a677-4063a30945af,0x800,0x12c000)
Boot0008* PXE IP4 Intel(R) Ethernet 10G 2P X550-t Adapter   PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(b4969130ba1c,1)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0009* PXE IP6 Intel(R) Ethernet 10G 2P X550-t Adapter   PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(b4969130ba1c,1)/IPv6([::]:<->[::]:,0,0)..BO
Boot000A* PXE IP4 Intel(R) Ethernet 10G 2P X550-t Adapter   PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(b4969130ba1e,1)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot000B* PXE IP6 Intel(R) Ethernet 10G 2P X550-t Adapter   PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(b4969130ba1e,1)/IPv6([::]:<->[::]:,0,0)..BO
Boot000C* Diskette Drive    BBS(Floppy,Diskette Drive,0x0)..BO
Boot000D* Internal HDD  BBS(HD,Internal HDD,0x0)..BO
Boot000E* USB Storage Device    BBS(USB,SanDisk,0x0)..BO
Boot000F* P7: HL-DT-ST DVD-ROM DH50N    BBS(CDROM,P7: HL-DT-ST DVD-ROM DH50N,0x0)..BO
Boot0010  Onboard NIC   BBS(Network,IBA CL Slot 00FE v0110,0x0)..BO
Boot0011* UEFI: SanDisk PciRoot(0x0)/Pci(0x14,0x0)/USB(7,0)/USB(1,0)/HD(1,GPT,87182ce7-da3d-414d-9ff3-3182544d7675,0x800,0x1dcf7df)..BO

find /boot | grep redhat:

/boot/efi/EFI/redhat
/boot/efi/EFI/redhat/fonts
/boot/efi/EFI/redhat/grubenv
/boot/efi/EFI/redhat/grubx64.efi

find /boot | grep centos:

/boot/efi/EFI/centos
/boot/efi/EFI/centos/shimx64-centos.efi
/boot/efi/EFI/centos/BOOTX64.CSV
/boot/efi/EFI/centos/mmx64.efi
/boot/efi/EFI/centos/grubenv
/boot/efi/EFI/centos/grub.cfg
/boot/efi/EFI/centos/shimx64.efi

rpm -qa | grep shim:

shim-x64-15-15.el8_2.x86_64
Djelibeybi commented 3 years ago

How are you running centos2ol.sh, i.e. what parameters are you using?

metal4lyf commented 3 years ago

With or without -k, doesn't seem to matter. When we don't pass -k, uek is installed but not enabled. Shim does upgrade when we downgrade to BaseOS grub. I've also verified with BaseOS grub that we can boot uek.

Djelibeybi commented 3 years ago

The shim-x64 package should be downgraded as part of the distro-sync that is run by default, i.e. after the switch you should have shim-x64-15-11 installed.

Djelibeybi commented 3 years ago

And if you don't pass -k, the UEK should be installed and enabled, again with the downgrade of shim. Something else is happening here. Can you run the switch and pipe the output to a log file so we can see the entire process? If possible, run the script with no parameters, i.e. bash centos2ol.sh | tee -a centos2ol.log

aburmash commented 3 years ago

@metal4lyf So you have centos shim and Oracle grub, that explains the problem. pretty sure if you will do 1) rpm -e shim-x64 ( remove centos shim ) 2) yum install shim-x64 ( from Oracle repos ) and do the reboot everything will automagically start working.

if NOT you will still need to replace centos shim with oracle shim and do efibootmgr -c -d /dev/sda -p 1 -L "Oracle Linux" -l "\EFI\redhat\shimx64.efi"

Where /dev/sda is the ESP disk 1 is the partition number. You can do mount |grep boot and see what disk is mounted at /boot/efi to determine that. ( please notice, i am writing about ESP disk, not boot disk ).

EDIT: you may need to do rpm -e shim-x64 --force but careful(!): 100% install a new shim after removal of old one.

EDIT2: we still need to figure out why in your case shim was not replaced.

metal4lyf commented 3 years ago

Thanks, I'll wipe this system and stage it for another run tomorrow AM. I will update with the logs and then try your suggestions. (The reason for CentOS shim and Oracle grub is because I downgraded grub to CentOS in recovery mode after the boot failed, which switched to CentOS shim, and thereafter upgraded grub to Oracle, which did not modify shim.)

Djelibeybi commented 3 years ago

Thanks @metal4lyf -- we very much appreciate the effort here!

aburmash commented 3 years ago

@metal4lyf if the system is not booting with Oracle shim + Oracle grub2, efibootmgr will save you. Pretty much we anyway should apply a fix on our side for this, so running efibootmgr should be an immediate fix for you, before it is addressed by migration script.

metal4lyf commented 3 years ago

Here's the state after a fresh migration with no flags to the script. I may have lost the log but I'll find it or run again and add here.

#!/bin/bash -xv

grubby --info=ALL | grep ^kernel
+ grubby --info=ALL
+ grep '^kernel'
kernel="/boot/vmlinuz-5.4.17-2036.104.4.el8uek.x86_64"
kernel="/boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64"
kernel="/boot/vmlinuz-4.18.0-147.el8.x86_64"
kernel="/boot/vmlinuz-0-rescue-1e1b6984890346aab6d2b455f4f5af16"

grubby --default-kernel
+ grubby --default-kernel
/boot/vmlinuz-5.4.17-2036.104.4.el8uek.x86_64

efibootmgr -v
+ efibootmgr -v
BootCurrent: 0001
Timeout: 1 seconds
BootOrder: 0001,000C,000D,000E,000F,0010,0006,0011,0008,0009,000A,000B
Boot0000* Windows Boot Manager  HD(1,GPT,87d93515-2374-4b87-9701-5a4c527ee83b,0x800,0x145000)/File(\EFI\Microsoft\Boot\bootmgfw.efi)WINDOWS.........x...B.C.D.O.B.J.E.C.T.=.{.9.d.e.a.8.6.2.c.-.5.c.d.d.-.4.e.7.0.-.a.c.c.1.-.f.3.2.b.3.4.4.d.4.7.9.5.}...;................
Boot0001* CentOS Linux  HD(1,GPT,1e67f230-95c6-44d2-a9be-0f5cccc00561,0x800,0x12c000)/File(\EFI\centos\shimx64.efi)
Boot0006* HDD   NVMe(0x1,01-00-00-00-00-00-00-00)/HD(1,GPT,54969a86-cdfd-4d17-a677-4063a30945af,0x800,0x12c000)
Boot0008* PXE IP4 Intel(R) Ethernet 10G 2P X550-t Adapter   PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(b4969130ba1c,1)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0009* PXE IP6 Intel(R) Ethernet 10G 2P X550-t Adapter   PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x0)/MAC(b4969130ba1c,1)/IPv6([::]:<->[::]:,0,0)..BO
Boot000A* PXE IP4 Intel(R) Ethernet 10G 2P X550-t Adapter   PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(b4969130ba1e,1)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot000B* PXE IP6 Intel(R) Ethernet 10G 2P X550-t Adapter   PciRoot(0x2)/Pci(0x0,0x0)/Pci(0x0,0x1)/MAC(b4969130ba1e,1)/IPv6([::]:<->[::]:,0,0)..BO
Boot000C* Diskette Drive    BBS(Floppy,Diskette Drive,0x0)..BO
Boot000D* Internal HDD  BBS(HD,Internal HDD,0x0)..BO
Boot000E* USB Storage Device    BBS(USB,SanDisk,0x0)..BO
Boot000F* P7: HL-DT-ST DVD-ROM DH50N    BBS(CDROM,P7: HL-DT-ST DVD-ROM DH50N,0x0)..BO
Boot0010  Onboard NIC   BBS(Network,IBA CL Slot 00FE v0110,0x0)..BO
Boot0011* UEFI: SanDisk PciRoot(0x0)/Pci(0x14,0x0)/USB(7,0)/USB(1,0)/HD(1,GPT,87182ce7-da3d-414d-9ff3-3182544d7675,0x800,0x1dcf7df)..BO

find /boot | grep redhat
+ find /boot
+ grep redhat
/boot/efi/EFI/redhat
/boot/efi/EFI/redhat/fonts
/boot/efi/EFI/redhat/grubenv
/boot/efi/EFI/redhat/grubx64.efi
/boot/efi/EFI/redhat/BOOTX64.CSV
/boot/efi/EFI/redhat/mmx64.efi
/boot/efi/EFI/redhat/shimx64.efi
/boot/efi/EFI/redhat/grub.cfg

find /boot | grep centos
+ find /boot
+ grep centos
/boot/efi/EFI/centos
/boot/efi/EFI/centos/grubenv
/boot/efi/EFI/centos/grub.cfg

rpm -qa | grep shim
+ rpm -qa
+ grep shim
shim-x64-15-11.0.5.x86_64
aburmash commented 3 years ago

OK, so what is actually happening in your case: since you have migrated from Centos to Oracle, centos EFI binaries are wiped, and Centos UEFI boot entry will be wiped on next reboot. In that case, normally ( on most systems ) /boot/efi/EFI/BOOT/BOOTX64.EFI binary is being executed ( that is the "default" boot path ) and it executes fallback, which creates UEFI boot entries for Oracle Linux. Looks like in your case that is not happening.

As an immediate measure you can run

efibootmgr -c -d /dev/sda -p 1 -L "Oracle Linux" -l "\EFI\redhat\shimx64.efi"

Where /dev/sda is the ESP disk 1 is the partition number. You can do mount |grep boot and see what disk is mounted at /boot/efi to determine that.

To create boot entry for Oracle Linux before the reboot. Ping if you are unsure what to do with efibootmgr, and i will provide a more detailed instruction.

Anyway, this case ( fallback not happening ) should be covered by our migration scripts, and that efibootmgr call should happen automatically.

metal4lyf commented 3 years ago

Ran the migration again. Log here: ol8.log

Before reboot I ran efibootmgr as follows: lsblk

NAME                MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                   8:0    1 14.9G  0 disk 
└─sda1                8:1    1 14.9G  0 part /mnt/sd
sr0                  11:0    1 1024M  0 rom  
nvme0n1             259:0    0  1.9T  0 disk 
├─nvme0n1p1         259:1    0  600M  0 part /boot/efi
├─nvme0n1p2         259:2    0    1G  0 part /boot
└─nvme0n1p3         259:3    0  1.9T  0 part 
  ├─VolGroup00-root 253:0    0   50G  0 lvm  /
  ├─VolGroup00-swap 253:1    0  128G  0 lvm  [SWAP]
  └─VolGroup00-home 253:2    0  1.7T  0 lvm  /home

efibootmgr -c -d /dev/nvme0n1 -p 1 -L "Oracle Linux" -l "\EFI\redhat\shimx64.efi":

BootCurrent: 0001
Timeout: 1 seconds
BootOrder: 0003,0001,0002,000C,000D,000E,000F,0010,0006,0011,0008,0009,000A,000B
Boot0000* Windows Boot Manager
Boot0001* CentOS Linux
Boot0002* Oracle Linux
Boot0006* HDD
Boot0008* PXE IP4 Intel(R) Ethernet 10G 2P X550-t Adapter
Boot0009* PXE IP6 Intel(R) Ethernet 10G 2P X550-t Adapter
Boot000A* PXE IP4 Intel(R) Ethernet 10G 2P X550-t Adapter
Boot000B* PXE IP6 Intel(R) Ethernet 10G 2P X550-t Adapter
Boot000C* Diskette Drive
Boot000D* Internal HDD
Boot000E* USB Storage Device
Boot000F* P7: HL-DT-ST DVD-ROM DH50N
Boot0010  Onboard NIC
Boot0011* UEFI: SanDisk
Boot0003* Oracle Linux

I got a warning about Oracle Linux already being present as Boot0002, but this does appear to have fixed the boot!

metal4lyf commented 3 years ago

Oracle Linux does now show up twice in our UEFI boot menu. Is there a variant of the efibootmgr command that would consolidate/overwrite instead?

EDIT: I may have clobbered the boot menu on the USB drive I've been using to reinstall this system. I wonder if the presence of this disk is related to the boot manager issues too?

aburmash commented 3 years ago

Well, if entry was already present you do not need to recreate it. Do both entries persist after reboot ? if yes, pretty much we ( and you ) will need a simple check to only execute efibootmgr in case Oracle Linux entry is NOT present, something like if ! efibootmgr -v |grep -q "Oracle Linux"; then //execute efibootmgr -c -d blahblah fi

USB disk can't affect number of entries since they are stored in NVRAM, not on any plugged in media. However (1): for the same reason ( NVRAM storage ), UEFI boot entries will not be wipted, if you reinstall the system, and binaries that are in those boot entries are actually present.

Djelibeybi commented 3 years ago

Ran the migration again. Log here: ol8.log

According to this log, the switch installed our shim-x64 package as an upgrade. I also noticed that the script had to upgrade a bunch of packages to get yum-utils to install. Did you perhaps do a dnf update on the CentOS instance before switching last time? Because this run looks pretty flawless from a log perspective (and would explain the duplicate UEFI boot entries).

metal4lyf commented 3 years ago

I did not run dnf update last time. If the server has network access, our installer adds an internal application package post-install and performs a distro sync, so perhaps that explains the difference? Sometimes I unplug network prior to save time. This all happens before running centos2ol.sh (with network).

I've run this many times now, with and without network on the initial install, and the result has always been the same. The logs from centos2ol.sh always look clean despite leaving the system unbootable.

Djelibeybi commented 3 years ago

@aburmash knows way more about UEFI than I do, so I'm hoping to see a pull request soon that adds a bit of efibootmgr magic to centos2ol.sh to mitigate this issue.

metal4lyf commented 3 years ago

Well, if entry was already present you do not need to recreate it. Do both entries persist after reboot ?

Looks like that was a fluke, or at any rate there is only one entry after reboot, so we're good there.

Thanks for all the help!

Djelibeybi commented 3 years ago

Just to be clear: are you now able to switch your Dell boxes to OL8 and still boot? I'm not sure if there's still an outstanding issue or not, and I wanted to check before I close this.

metal4lyf commented 3 years ago

Yes, it's working now with the efibootmgr fix. Here's what ultimately works after running centos2ol.sh:

# remove CentOS Linux (it is now unbootable)
efibootmgr -b $(efibootmgr | grep 'CentOS Linux' | sed -r 's/Boot([0-9A-F]+).*/\1/') -B
# remove any Oracle Linux (if it was necessary to convert more than once, existing entries will be unbootable)
efibootmgr -b $(efibootmgr | grep 'Oracle Linux' | sed -r 's/Boot([0-9A-F]+).*/\1/') -B
# add new entry for Oracle Linux
disk=/dev/$(lsblk -o MOUNTPOINT,PKNAME,KNAME | grep /boot/efi | awk '{print $2}')
part=$(lsblk -o MOUNTPOINT,PKNAME,KNAME | grep /boot/efi | awk '{print $3}' | grep -o '[0-9]*$')
efibootmgr -c -d $disk -p $part -L "Oracle Linux" -l "\EFI\redhat\shimx64.efi"
Djelibeybi commented 3 years ago

Thanks. I've updated the issue title so that we can use it as a reference for any submitted pull requests.

aburmash commented 3 years ago

Referemcing another issue with similar reason https://github.com/oracle/centos2ol/issues/73

bbbjames commented 7 months ago

Thank you team & @aburmash let's go!

Switching default boot kernel to the UEK. Removing yum cache Switch complete. Oracle recommends rebooting this system.

Reboot.

Dead machine (>.<)

Cannot find OS.

Bruh.

Done on Centos 7 - got to attach some kind of boot device to access the machine before I can investigate.

Here is what i ended up doing,

First, found another boot device to get into terminal, then:

fdisk -l mount /dev/sda1 /mnt/

cd /mnt/EFI (ls shows BOOT centos Dell redhat)

efibootmgr -c -d /dev/sda -p 1 -L "Oracle Linux" -l "\EFI\redhat\shimx64.efi"

BootCurrent: 0001 Boot0001 Oracle Linux (also still have, among others) Boot0000 CentOS

System now boots into Oracle Linux Server release 7.9 🙌

What should I be doing next? Thank you again 💛