oracle / oracle-linux

Scripts, examples, and tutorials to get started with Oracle Linux
Universal Permissive License v1.0
132 stars 43 forks source link

Latest GRUB update breaks booting #148

Open robertm98 opened 3 months ago

robertm98 commented 3 months ago

This is a different bug compared to what is described in https://github.com/oracle/oracle-linux/issues/147

When the latest updates are applied and a server is then rebooted GRUB will not start and appears to be stuck in a busy loop displaying the following message. "error: ../../grub-core/commands/efi/tpm.c:150:unknown TPM error"

Secure Boot is disabled and no previous problems.

Steps to reproduce:

Download and install OL 9.4 x86_64 OK for first boot. Apply updates Reboot and GRUB will then fail to load with the above error message.

As a cross check a fresh install was done and grub updates were excluded with exclude=grub* in the /etc/dnf/dnf.conf file.

The non-grub updates were installed and the server rebooted OK.

aburmash commented 3 months ago

Hello! Thanks for the report, in fact last update issued for linked issue has zero code changes, though it MIGHT have regenerated a grub config for you, maybe that is triggering the issue. Are you seeing any other errors except for unknown TPM error ? Are you using BTRFS filesystem or/and BTRFS snapshots ?

aburmash commented 3 months ago

Nevermind, reproduced it, we are going to pull out this update and issue a proper one shortly.

robertm98 commented 3 months ago

Thank you. For info the filesystem is XFS. A minor change is the name of lvm group form "ol" to "olb" so as not to clash with the volume group name of the the previous installation on the original drive when I copy files across. I wondered if this could be relevant due to the questions about the filesystem, but from your last reply probably not. The installation is on a separate SATA drive and all other drives are disconnected.

aburmash commented 3 months ago

@robertm98 once again thank you very much! I see that it is not related to filesystems, just broken grub config.

m45733r commented 3 months ago

same issue here, is there any way to fix broken grub / grub.cfg from within UEFI interactive shell?

robertm98 commented 3 months ago

The only way I think this could be repaired is to do a recovery boot from the installation media. chroot to /mnt/sysroot (I think) then possibly use dnf to do a roll back or edit the config. @aburmash Would it be possible to get the details of the errors in the config and what needs to be done to make things good, please? What needs editing and then running to apply the config changes.

aburmash commented 3 months ago

@robertm98 @m45733r i will provide recovery instructions from UEFI shell shortly.

aburmash commented 3 months ago

@m45733r 1) if you have already installed bad update, but did not reboot: grub2-mkconfig > /boot/grub2/grub.cfg OR grub2-mkconfig > /boot/efi/EFI/redhat/grub.cfg 2) if you can only do stuff from UEFI shell.

      FS0: Alias(s):HD0a1b:;BLK1:
          PciRoot(0x0)/Pci(0x4,0x0)/Scsi(0x0,0x1)/HD(1,GPT,3AF7074E-C0BB-400D-8FC7-E9EC738AA53F,0x800,0x32000)
     BLK0: Alias(s):
          PciRoot(0x0)/Pci(0x4,0x0)/Scsi(0x0,0x1)
     BLK2: Alias(s):
          PciRoot(0x0)/Pci(0x4,0x0)/Scsi(0x0,0x1)/HD(2,GPT,14BE7023-6C02-4573-8891-9F639B9D936A,0x32800,0x400000)
     BLK3: Alias(s):
          PciRoot(0x0)/Pci(0x4,0x0)/Scsi(0x0,0x1)/HD(3,GPT,E700F071-90A5-40BB-8132-52AF688193B7,0x432800,0x5900800)****
fs0:
ls

if you see EFI dir, you are where you need to be

cd EFI/redhat
rm grub.cfg
grubx64.efi

you will be dropped to grub cmdline ls it will display list of disks available, there you need to find a disk that has /boot dir or identify /boot partition run ls <disk>/ to see which one is that for example: ls (hd0,gpt2)/ when you have found the /boot you will see something like

grub> ls (hd0,gpt2)/
./ ../ efi/ grub2/ loader/ vmlinuz-5.14.0-427.16.1.el9_4.x86_64
System.map-5.14.0-427.16.1.el9_4.x86_64 config-5.14.0-427.16.1.el9_4.x86_64
.vmlinuz-5.14.0-427.16.1.el9_4.x86_64.hmac
symvers-5.14.0-427.16.1.el9_4.x86_64.gz
initramfs-5.14.0-427.16.1.el9_4.x86_64.img
vmlinuz-5.15.0-206.153.7.el9uek.x86_64
System.map-5.15.0-206.153.7.el9uek.x86_64 config-5.15.0-206.153.7.el9uek.x86_64
.vmlinuz-5.15.0-206.153.7.el9uek.x86_64.hmac
symvers-5.15.0-206.153.7.el9uek.x86_64.gz
initramfs-5.15.0-206.153.7.el9uek.x86_64.img
initramfs-0-rescue-36703c3cdc50ff74e863e867384f6a8a.img
vmlinuz-0-rescue-36703c3cdc50ff74e863e867384f6a8a
initramfs-5.15.0-206.153.7.el9uek.x86_64kdump.img 

Now you need to check boot info for you kernel ls (hd0,gpt2)/loader/entries/

grub> ls (hd0,gpt2)/loader/entries/
./ ../ 8c622b7d13354f7fbe5eee50d3f340bd-5.14.0-427.16.1.el9_4.x86_64.conf
8c622b7d13354f7fbe5eee50d3f340bd-5.15.0-206.153.7.el9uek.x86_64.conf
36703c3cdc50ff74e863e867384f6a8a-0-rescue.conf

cat (hd0,gpt2)/loader/entries/8c622b7d13354f7fbe5eee50d3f340bd-5.15.0-206.153.7.el9uek.x86_64.conf You will see something like:

title Oracle Linux Server (5.15.0-206.153.7.el9uek.x86_64 with Unbreakable Ente
rprise Kernel) 9.4
version 5.15.0-206.153.7.el9uek.x86_64
linux /vmlinuz-5.15.0-206.153.7.el9uek.x86_64
initrd /initramfs-5.15.0-206.153.7.el9uek.x86_64.img $tuned_initrd
options root=/dev/mapper/ocivolume-root ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M LANG=en_US.UTF-8 console=tty0 console=ttyS0,115200 rd.luks=0 rd.md=0 rd.dm=0 rd.lvm.vg=ocivolume rd.lvm.lv=ocivolume/root rd.net.timeout.dhcp=10 rd.net.timeout.carrier=5 netroot=iscsi:169.254.0.2:::1:iqn.2015-02.oracle.boot:uefi rd.iscsi.param=node.session.timeo.replacement_timeout=6000 net.ifnames=1 nvme_core.shutdown_timeout=10 ipmi_si.tryacpi=0 ipmi_si.trydmi=0 libiscsi.debug_libiscsi_eh=1 loglevel=4 crash_kexec_post_notifiers
grub_users $grub_users
grub_arg --unrestricted
grub_class ol

Now still in grub cmdline run:

linux (hd0,gpt2)/vmlinuz-5.15.0-206.153.7.el9uek.x86_64 root=/dev/mapper/ocivolume-root ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M LANG=en_US.UTF-8 console=tty0 console=ttyS0,115200 rd.luks=0 rd.md=0 rd.dm=0 rd.lvm.vg=ocivolume rd.lvm.lv=ocivolume/root rd.net.timeout.dhcp=10 rd.net.timeout.carrier=5 netroot=iscsi:169.254.0.2:::1:iqn.2015-02.oracle.boot:uefi rd.iscsi.param=node.session.timeo.replacement_timeout=6000 net.ifnames=1 nvme_core.shutdown_timeout=10 ipmi_si.tryacpi=0 ipmi_si.trydmi=0 libiscsi.debug_libiscsi_eh=1 loglevel=4 crash_kexec_post_notifiers
initrd (hd0,gpt2)/initramfs-5.15.0-206.153.7.el9uek.x86_64.img
boot

where kernel = kernel form config options for kernel = options from config initrd = initrd from config IMPORTANT: when doing copy/pastes VERIFY that linux string is a single string, if you have newlines or returns in the buffer - they will NOT be applied. So when you have full linux string copied - paste it to some file to verify that it is a single string. do not forget that path is relative to your partition with /boot or /boot partition. If your /boot is on /root partition, you will need to find the disk with root partition and your paths will be something like (lvm/volume-root)/boot/

When system is booted run: grub2-mkconfig > /boot/grub2/grub.cfg grub2-mkconfig > /boot/efi/EFI/redhat/grub.cfg

aburmash commented 3 months ago

@robertm98 the problem is that on OL9, config file for grub2 was switched to parent config in /boot/efi/EFI/redhat/grub.cfg that in order loads proper /boot/grub2/grub.cfg config.

For CERTAIN /boot/efi/EFI/redhat/grub.cfg config contents fix that was applied for leapp in-place upgrade instead of correctly updating configs ( or not touching them ), writes /boot/efi/EFI/redhat/grub.cfg into /boot/grub2/grub.cfg and system chainloops.

m45733r commented 3 months ago

Thanks for the instructions, some remarks from my expierence: Running grubx64.efi after grub.cfg was deleted did not automatically put me into grub cmdline but was stuck and I needed to power-cycle the machine. ls (hd0,gpt1) only shows "Filesystems is fat" or "Filesystem is xfs", not actual contents. However ls (hd0,gpt2)/loader/entries would only succeed on the right disk and list its contents, and show not found on all others.

boot was successful, but after login + grub2-mkconfig + reboot it would return to grub cmdline again :/ Reading your latest comment I tried mkconfig to /boot/efi/EFI/redhat/grub.cfg and it seems to work now!

aburmash commented 3 months ago

ls (hd0,gpt1)

yeah, you need slash in the end to display content: ls (hd0,gpt1)/

boot was successful, but after login + grub2-mkconfig + reboot it would return to grub cmdline again :/

OH! yes, that is because /boot/efi/EFI/redhat/grub.cfg was removed from UEFI shell during recovery. I've updated my post to reflect that.

robertm98 commented 3 months ago

Thank you.

m45733r commented 3 months ago

Im not sure if that is related to the original issue but the only thing that is a bit weird now is that grubby shows:

[root@ol9-machine ~]# grubby --default-kernel
/boot/vmlinuz-5.15.0-207.156.6.el9uek.x86_64
[root@ol9-machine ~]# grubby --default-index
3
[root@ol9-machine ~]# grubby --info DEFAULT
index=3
kernel="/boot/vmlinuz-5.15.0-207.156.6.el9uek.x86_64"
args="ro rd.lvm.lv=ol/root rhgb quiet crashkernel=1G-64G:448M,64G-:512M $tuned_params"
root="/dev/mapper/ol-root"
initrd="/boot/initramfs-5.15.0-207.156.6.el9uek.x86_64.img $tuned_initrd"
title="Oracle Linux Server (5.15.0-207.156.6.el9uek.x86_64 with Unbreakable Enterprise Kernel) 9.4"
id="bda9a182a36740ada28baaa218d5c09d-5.15.0-207.156.6.el9uek.x86_64"

And yet, when I reboot it would automatically select index 0 with a kernel that is no longer present in /boot. So the system is usable but wouldnt survive an automated reboot. See screenshot attached.

[root@ol9-machine ~]# uname -r
5.15.0-207.156.6.el9uek.x86_64
[root@ol9-machine ~]# dnf list installed | grep kernel
kernel.x86_64                         5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-core.x86_64                    5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-modules.x86_64                 5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-modules-core.x86_64            5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-tools.x86_64                   5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-tools-libs.x86_64              5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-uek.x86_64                     5.15.0-207.156.6.el9uek             @ol9_UEKR7
kernel-uek-core.x86_64                5.15.0-207.156.6.el9uek             @ol9_UEKR7
kernel-uek-modules.x86_64             5.15.0-207.156.6.el9uek             @ol9_UEKR7

Any help appreciated.

image

aburmash commented 3 months ago

can you show please for x in $(find /boot |grep grubenv); do echo $x; cat $x; done

cat /boot/efi/EFI/redhat/grub.cfg |grep grubenv
cat /boot/grub2/grub.cfg |grep grubenv
m45733r commented 3 months ago

Sure, here you go:

/boot/grub2/grubenv
# GRUB Environment Block
# WARNING: Do not edit this file by tools other than grub-editenv!!!
saved_entry=bda9a182a36740ada28baaa218d5c09d-5.15.0-207.156.6.el9uek.x86_64
boot_success=1
boot_indeterminate=0
##################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

/boot/efi/EFI/redhat/grub.cfg

if [ -f ${config_directory}/grubenv ]; then
  load_env -f ${config_directory}/grubenv
elif [ -s $prefix/grubenv ]; then
# The kernelopts variable should be defined in the grubenv file. But to ensure that menu
# without a grubenv file, define a fallback kernelopts variable if this has not been set.
# The kernelopts variable in the grubenv file can be modified using the grubby tool or by
# the kernelopts variable in the grubenv file and the fallback kernelopts variable.

/boot/grub2/grub.cfg

if [ -f ${config_directory}/grubenv ]; then
  load_env -f ${config_directory}/grubenv
elif [ -s $prefix/grubenv ]; then
# The kernelopts variable should be defined in the grubenv file. But to ensure that menu
# without a grubenv file, define a fallback kernelopts variable if this has not been set.
# The kernelopts variable in the grubenv file can be modified using the grubby tool or by
# the kernelopts variable in the grubenv file and the fallback kernelopts variable.
aburmash commented 3 months ago

OK, everything above looks correct. Now ls /boot/loader/entries/

It seems you have some redundant entries there.

m45733r commented 3 months ago
[root@ol9-machine grub2]# ls -al /boot/loader/entries/
total 28
drwx------. 2 root root 4096 Jun 25 13:34 .
drwxr-xr-x. 3 root root   21 Oct 17  2022 ..
-rw-r--r--. 1 root root  440 May 22 13:59 495620e0609f491080cb4e769e86283d-0-rescue.conf
-rw-r--r--. 1 root root  381 May 22 13:59 495620e0609f491080cb4e769e86283d-5.14.0-284.30.1.el9_2.x86_64.conf
-rw-r--r--. 1 root root  428 May 22 13:59 495620e0609f491080cb4e769e86283d-5.15.0-200.131.27.el9uek.x86_64.conf
-rw-r--r--. 1 root root  405 May 22 13:59 bda9a182a36740ada28baaa218d5c09d-0-rescue.conf
-rw-r--r--. 1 root root  381 Jun 25 10:18 bda9a182a36740ada28baaa218d5c09d-5.14.0-427.22.1.el9_4.x86_64.conf
-rw-r--r--. 1 root root  424 Jun 25 10:19 bda9a182a36740ada28baaa218d5c09d-5.15.0-207.156.6.el9uek.x86_64.conf

oh, heres the problem - sorry for bothering you - but thanks for pointing me in the right direction. looks like (some script or person) regenerated the machine-id a few weeks ago...

aburmash commented 3 months ago

For everyone tracking this issue: grub2 updates that does NOT contain scriptlet bug and, at the same time, resolves the issue for people who had installed broken package, but did not reboot, was published to public repositories:

version is 2.06-80.0.3.el9_4