home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
4.93k stars 971 forks source link

GRUB failing to load kernel on Intel Atom boards (Intel NM10 chipset) #3305

Closed HAPSagan closed 2 months ago

HAPSagan commented 6 months ago

Describe the issue you are experiencing

I see GNU GRUB with 4 options - Slot A, Slot B, Slot A rescue shell, Slot B rescue shell. Selecting any of them results in a message that it's unable to boot. I dont get any CLI options. I have used Linux Reader to download the backups from the disk and have then tried to do a fresh installation with OS12.2. The result is the same as after the update - unable to boot from any slot. After that I did a fresh install with OS 12.1 and everything started fine again.

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

12.1

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. Do a fresh install with OS 12.2 or update to OS 12.2 and it's unable to boot.
  2. ...

Anything in the Supervisor logs that might be useful for us?

Can´t get a log.

Anything in the Host logs that might be useful for us?

Can´t get a log.

System information

No response

Additional information

No response

ahmetem commented 6 months ago

same problem . HA OS doesn't boot when updating from 12.1 to 12.2

Mpgod80 commented 6 months ago

Same issue here. Going from 12.1 to 12.2 then I can´t boot.

Onepamopa commented 6 months ago

Same here... When the VM reboots after the update - it selects "Slot B" which has "unknown filesystem". I have to manually select "Slot A" to boot - it boots 12.1 w/o an update.

agners commented 6 months ago

@HAPSagan @ahmetem @Mpgod80 you all are running on native x86-64 hardware? What hardware are you using?

@Onepamopa can you open a new issue along with information of your virtualization environment? It also seems yours behaved different as the old boot slot still worked (unlike the case OP reported).

ahmetem commented 6 months ago

Its features are below. It was working smoothly until the last update. tried to do a fresh installation with OS12.2. but the result did not change. Slot A, Slot B, Slot A rescue shell, Slot B rescue shell .none of them work.

MacBook Air (11-inch, Mid 2011) a1370 64GB flash storage 1.6GHz dual-core Intel Core i5 2GB of 1333MHz DDR3 onboard memory Advanced Intel HD Graphics 3000.

agners commented 6 months ago

On a new installation, what happens exactly when you choose Slot A with HAOS 12.2? Anything written on the screen? Black screen? Reset?

Mpgod80 commented 6 months ago

@agners The hardware both me and HAPSagan is runing on is:

Intel Atom CPU D2500 @ 1,86Ghz 4,0 Gb RAM 60gb SSD. Runing HAOS img directly on the SSD drive with an image flashed in Balena Etcher.

ahmetem commented 6 months ago

When I try to make a new installation, the same list appears. Slot A (OK=0 TRY=0) Slot B (OK=0 TRY=0) and after the selection is made, a line flashes on the screen.

Onepamopa commented 6 months ago

@HAPSagan @ahmetem @Mpgod80 you all are running on native x86-64 hardware? What hardware are you using?

@Onepamopa can you open a new issue along with information of your virtualization environment? It also seems yours behaved different as the old boot slot still worked (unlike the case OP reported).

My environment is a bit weird... Proxmox 6.x (EOL), LVM storage only (no lvm-thin or disk storage). I had to convert the qcow2 image into a "raw" LVM disk (which I also resized from 32G to 60G).

qm importdisk VMID THE_DOWNLOADED_DISK.qcow2 Target_LVM_Storage --format raw qm resize VMID virtio1 +28G (where virtio1 is the imported disk)

After this - the VM booted, HA utilized the whole storage perfectly.

Restored my backup, all good. All updates work fine, apart from the OS 12.2 one...

Here's the VM config

agent: 1,type=virtio
balloon: 0
bios: ovmf
boot: order=ide2;virtio1;net0
cores: 4
cpu: host,flags=+aes
cpuunits: 150000
efidisk0: NVME_SMS:vm-106-disk-1,size=4M
ide2: none,media=cdrom
machine: q35
memory: 4096
name: HAOS
net0: virtio=16:42:50:80:BF:E5,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=cf66a7bc-1960-4d89-b3c9-b08c01a9ddf4
sockets: 1
startup: order=0,up=30
usb0: host=3-1,usb3=1
virtio1: NVME_SMS:vm-106-disk-2,iothread=1,size=60G
vmgenid: 4cb1066f-7cdd-4604-8c63-f57c4042eb20
henrikcaesar commented 6 months ago

Same here 😱 Also running an old Intel atom with cpu integrated on the motherboard. Can’t get any info out of the system. If it would help I can get a live Linux usb.

ahmetem commented 6 months ago

i boot ha with Super Grub2 Disk image on usb disk. show all boot menu. and selected slot a. it open normaly. but i dont know How do I make this permanent? or fix it. i think re install grub. or maybe downgrade 12.1 fix it.

coc commented 6 months ago

Same problem on a ThinkCentre m93p.

SellSan commented 6 months ago

I Have HP T610 and the same issue. The only I can do I can run command from this window and can edit these 4 options parameters. 20240413_085029

Onepamopa commented 6 months ago

I Have HP T630 and the same issue. The only I can do I can run command from this window and can edit these 4 options parameters. 20240413_085029

I have the same screen but when I select "Slot A" it boots into 12.1. Have you tried?

Mpgod80 commented 6 months ago

I Have HP T630 and the same issue. The only I can do I can run command from this window and can edit these 4 options parameters. 20240413_085029

I have the same screen but when I select "Slot A" it boots into 12.1. Have you tried?

Tried that but that wont work for me :/

SellSan commented 6 months ago

I Have HP T630 and the same issue. The only I can do I can run command from this window and can edit these 4 options parameters. 20240413_085029

I have the same screen but when I select "Slot A" it boots into 12.1. Have you tried?

I tried, none of the options worked...

JitteryDoodle commented 6 months ago

I have an Intel NUC D54250WYK and also experienced this issue when upgrading to 12.2. Luckily, selecting Slot A booted back into 12.1.

Omri-Kleynhans commented 6 months ago

Even downloading haos_generic-x86-64-12.2.img and created new installation does not work. Boot not possible. Same error as afterupgrade from 12.1 to 12.2

Onepamopa commented 6 months ago

So... they screwed up this update, bad... :) Let's see who'll take the blame. Hopefully this isn't a push to "buy HA-approved hardware" ...

agners commented 6 months ago

It sounds like a GRUB2 issue of some kind. That got updated from GRUB 2.06 to 2.12 as far as I can tell.

FWIW, the rc (and release) versions have been tested and are working fine on various Intel NUC systems. It seems a particular hardware/BIOS which causes issue.

It kinda reminds me of that bug fix we added in 8.0.rc4 (https://github.com/home-assistant/operating-system/issues/1830#issuecomment-1119718325), but it seems that is applied upstream in GRUB 2.12 today, so that should no longer be the problem :thinking:

Can you replace the U-Boot binary (on the first partition of the disk at /EFI/BOOT/bootx64.efi) with the version from HAOS 12.1 and see if this boots?

Omri-Kleynhans commented 6 months ago

It sounds like a GRUB2 issue of some kind. That got updated from GRUB 2.06 to 2.12 as far as I can tell.

FWIW, the rc (and release) versions have been tested and are working fine on various Intel NUC systems. It seems a particular hardware/BIOS which causes issue.

It kinda reminds me of that bug fix we added in 8.0.rc4 (#1830 (comment)), but it seems that is applied upstream in GRUB 2.12 today, so that should no longer be the problem 🤔

Can you replace the U-Boot binary (on the first partition of the disk at /EFI/BOOT/bootx64.efi) with the version from HAOS 12.1 and see if this boots?

Yes have done that, replaced all files at `/EFI/BOOT/'. It does work. Server can boot again HAS back up. Thanks

SellSan commented 6 months ago

It sounds like a GRUB2 issue of some kind. That got updated from GRUB 2.06 to 2.12 as far as I can tell. FWIW, the rc (and release) versions have been tested and are working fine on various Intel NUC systems. It seems a particular hardware/BIOS which causes issue. It kinda reminds me of that bug fix we added in 8.0.rc4 (#1830 (comment)), but it seems that is applied upstream in GRUB 2.12 today, so that should no longer be the problem 🤔 Can you replace the U-Boot binary (on the first partition of the disk at /EFI/BOOT/bootx64.efi) with the version from HAOS 12.1 and see if this boots?

Yes have done that, replaced all files at `/EFI/BOOT/'. It does work. Server can boot again HAS back up. Thanks

Hi, could you share where from I can download that file?

pierrepaap commented 6 months ago

same issue. hardware : Intel D525MW motherboard (bios from 2010) a bit more history there https://community.home-assistant.io/t/os-12-2-upgrade-left-ha-on-grub-menu-unbootable/715974/17

I'm surprised though that the 'backup' entry also does not start. I would assume that the EFI file would still be from 12.1 ?

Onepamopa commented 6 months ago

So... any viable solution to update to 12.2 ? Or we should wait for 12.3..

pierrepaap commented 6 months ago

So... any viable solution to update to 12.2 ? Or we should wait for 12.3..

well... after replacing the uboot efi with the one of the 12.1 image I am now in 12.2 Obivously a hassle. So 'im going to stay like this until this bug is fixed

 Core 2024.4.2
Supervisor 2024.04.0
Operating System 12.2
Frontend 20240404.1

EDIT: corrected typo 'I am now in 12.2'

Onepamopa commented 6 months ago

So... any viable solution to update to 12.2 ? Or we should wait for 12.3..

well... after replacing the uboot efi with the one of the 12.1 image I am not in 12.2 Obivously a hassle. So 'im going to stay like this until this bug is fixed

 Core 2024.4.2
Supervisor 2024.04.0
Operating System 12.2
Frontend 20240404.1

Ugh.... it says you're on 12.2 ?

pierrepaap commented 6 months ago

So... any viable solution to update to 12.2 ? Or we should wait for 12.3..

well... after replacing the uboot efi with the one of the 12.1 image I am not in 12.2 Obivously a hassle. So 'im going to stay like this until this bug is fixed

 Core 2024.4.2
Supervisor 2024.04.0
Operating System 12.2
Frontend 20240404.1

Ugh.... it says you're on 12.2 ?

typo... I am NOW in 12.2

agners commented 6 months ago

I'm surprised though that the 'backup' entry also does not start. I would assume that the EFI file would still be from 12.1 ?

We only ship a single boot loader. In a way, we rely on the bootloader to implement the backup boot method. But if that new bootloader fails in some ways, it is essentially game over :cry:

Hi, could you share where from I can download that file?

These are the two U-Boot boot loader files from HAOS 12.1 (replace the existing ones in /EFI/BOOT on the first boot partition with these files): grub-2.06-haos-12.1.zip

Are those affected UEFI BIOS'es maybe 32-bit BIOSes? E.g. can someone check if only replacing bootia32.efi helps (so maybe this is related to #1752)?

agners commented 6 months ago

Same here... When the VM reboots after the update - it selects "Slot B" which has "unknown filesystem". I have to manually select "Slot A" to boot - it boots 12.1 w/o an update.

@Onepamopa your case is different. Anyone else here can't boot either slot. In your case it is not the boot loader failing, but somehow that new boot slot did not get written properly. Also your's is a VM (ova image), whereas everyone else is runnning the generic-x86-64 image on native hardware. Again, please open a new issue for your case. It is very hard for us to track bugs when folks mix independent issue in a single issue report.

Onepamopa commented 6 months ago

Same here... When the VM reboots after the update - it selects "Slot B" which has "unknown filesystem". I have to manually select "Slot A" to boot - it boots 12.1 w/o an update.

@Onepamopa your case is different. Anyone else here can't boot either slot. In your case it is not the boot loader failing, but somehow that new boot slot did not get written properly. Also your's is a VM (ova image), whereas everyone else is runnning the generic-x86-64 image on native hardware. Again, please open a new issue for your case. It is very hard for us to track bugs when folks mix independent issue in a single issue report.

Yes, I downloaded haos_ova-12.1.qcow2.xz however the symptoms look like they come from the exact same issue - something in the bootloader of 12.2 caused this, don't you agree?

Also, considering maybe I'm the only one experiencing this all-so-similar issue with a VM, that wouldn't generate much attention now, would it?

agners commented 6 months ago

Yes, I downloaded haos_ova-12.1.qcow2.xz however the symptoms look like they come from the exact same issue - something in the bootloader of 12.2 caused this, don't you agree?

It points to 12.2, I am not sure about the bootloader part: The new boot loader seems to be still able to boot the old boot slot. So the boot loader per-se is capable of booting a HAOS. This is fundamentally different from the other reports: The bootloader seems not to be able to boot even the old boot slot.

Also, when you read the other messages, people who replace just the boot loader are able to boot into 12.2 then. So for them 12.2 works, it is just the bootloader which makes troubles.

Also, considering maybe I'm the only one experiencing this all-so-similar issue with a VM, that wouldn't generate much attention now, would it?

I do have ideas how to debug your case, but clutter this issue with logs from a different case is just not the way it works. So I don't ask you here. This is about a different issue. If you want me seriously look into your case, then you'll have to open a new issue.

Shad0wguy commented 6 months ago

Same issue here. Both Slot A and B show error: cannot load image. Also running on an Atom based system as others have reported.

Shad0wguy commented 6 months ago

I'm surprised though that the 'backup' entry also does not start. I would assume that the EFI file would still be from 12.1 ?

We only ship a single boot loader. In a way, we rely on the bootloader to implement the backup boot method. But if that new bootloader fails in some ways, it is essentially game over 😢

Hi, could you share where from I can download that file?

These are the two U-Boot boot loader files from HAOS 12.1 (replace the existing ones in /EFI/BOOT on the first boot partition with these files): grub-2.06-haos-12.1.zip

Are those affected UEFI BIOS'es maybe 32-bit BIOSes? E.g. can someone check if only replacing bootia32.efi helps (so maybe this is related to #1752)?

I believe my system (Atom D525) does have a 32-bit bios even though the cpu is 64-bit. Could that be the cause?

Shad0wguy commented 6 months ago

I'm surprised though that the 'backup' entry also does not start. I would assume that the EFI file would still be from 12.1 ?

We only ship a single boot loader. In a way, we rely on the bootloader to implement the backup boot method. But if that new bootloader fails in some ways, it is essentially game over 😢

Hi, could you share where from I can download that file?

These are the two U-Boot boot loader files from HAOS 12.1 (replace the existing ones in /EFI/BOOT on the first boot partition with these files): grub-2.06-haos-12.1.zip

Are those affected UEFI BIOS'es maybe 32-bit BIOSes? E.g. can someone check if only replacing bootia32.efi helps (so maybe this is related to #1752)?

I booted into a live image on my HAOS pc and mounted the disk. None of the partitions seem to have a /EFI/BOOT path.

sairon commented 6 months ago

I believe my system (Atom D525) does have a 32-bit bios even though the cpu is 64-bit. Could that be the cause?

The discrepancy itself isn't really problematic but in this particular case yes - it seems that this issue only affects systems that only have 32bit UEFI support - thus the request to check if replacing just the bootia32.efi helps (while bootx64.efi shouldn't be needed at all). 32bit UEFI isn't (thankfully) that common and there are probably more factors involved that caused the issue to be discovered only now.

I booted into a live image on my HAOS pc and mounted the disk. None of the partitions seem to have a /EFI/BOOT path.

The partition might be hidden from file managers because of the flags it has set but it should be always present - it's always the first 32 MiB partition and you should be able to mount it with e.g. with sudo mount /dev/sdb1 /mountpoint (you might need to replace sdb with the disk's identifier and /mnt/boot with an empty/different folder.

Shad0wguy commented 6 months ago

I believe my system (Atom D525) does have a 32-bit bios even though the cpu is 64-bit. Could that be the cause?

The discrepancy itself isn't really problematic but in this particular case yes - it seems that this issue only affects systems that only have 32bit UEFI support - thus the request to check if replacing just the bootia32.efi helps (while bootx64.efi shouldn't be needed at all). 32bit UEFI isn't (thankfully) that common and there are probably more factors involved that caused the issue to be discovered only now.

I booted into a live image on my HAOS pc and mounted the disk. None of the partitions seem to have a /EFI/BOOT path.

The partition might be hidden from file managers because of the flags it has set but it should be always present - it's always the first 32 MiB partition and you should be able to mount it with e.g. with sudo mount /dev/sdb1 /mountpoint (you might need to replace sdb with the disk's identifier and /mnt/boot with an empty/different folder.

You were right. I had to mount sda1. I tried copying just the bootia32.efi but that gave the same result. I then added bootx64.efi as well and it booted. Though I am back on 12.1.

agners commented 6 months ago

You were right. I had to mount sda1. I tried copying just the bootia32.efi but that gave the same result. I then added bootx64.efi as well and it booted. Though I am back on 12.1.

Hm, so it seems to be the 64-bit GRUB in your case then :thinking:

You probably can boot into 12.2 by manually select the other slot at bootup.

SellSan commented 6 months ago

I'm surprised though that the 'backup' entry also does not start. I would assume that the EFI file would still be from 12.1 ?

We only ship a single boot loader. In a way, we rely on the bootloader to implement the backup boot method. But if that new bootloader fails in some ways, it is essentially game over 😢

Hi, could you share where from I can download that file?

These are the two U-Boot boot loader files from HAOS 12.1 (replace the existing ones in /EFI/BOOT on the first boot partition with these files): grub-2.06-haos-12.1.zip

Are those affected UEFI BIOS'es maybe 32-bit BIOSes? E.g. can someone check if only replacing bootia32.efi helps (so maybe this is related to #1752)?

I replaced the bootx64.efi file and system started, now I have 12.1 version.

sairon commented 6 months ago

We're trying to gather as much information as possible and get a device where the issue can be reproduced, as this one will be harder to debug remotely and simply reverting the upgrade is not viable in the long term. From the reports here, these need some clarification:

@HAPSagan @Mpgod80 - you only mentioned Intel Atom CPU D2500 - can you please clarify what board/device is it on?

@henrikcaesar - as well, you only said it's an "old Intel Atom motherboard" - could you please be more specific?

@Omri-Kleynhans - you shared no details on the hardware used, could you do so, please?

@SellSan - your device is an odd-ball here, as it shows the same symptoms but has a different CPU than all those Intel Atom-based devices (AMD GX-420GI). Would it be possible to check if a fresh install of HAOS 12.2 doesn't really boot? Also can you check what BIOS versions is it running? I remember seeing more people running these, so I wonder there are no more reports around. Also anyone else running the same device can chime in.

@coc - you stated ThinkCentre m93p which should be an Intel i5 4th Gen. Does it indeed show the same symptoms and only replacing the EFI files helped?

@JitteryDoodle - same as the above - it's an Intel i5 and booting into the other slot should not be possible if the cause is the same as discussed in the issue. Ideally please try performing another upgrade, and if it fails again, open a new issue with logs from the previous boot (ha host logs -b-1 -n 10000).

henrikcaesar commented 6 months ago

@sairon found the 📦 D525MW board with Intel Atom D525, from 2010 it seems.

IMG_8564

Shad0wguy commented 6 months ago

@sairon found the 📦 D525MW board with Intel Atom D525, from 2010 it seems.

This is the same board I have with this issue.

aman-sandhu commented 6 months ago

Sorry to barge in. I am having the same problem with my hp thin client t620. Nothing would work so I had to downgrade.

sairon commented 6 months ago

@aman-sandhu That is strange, because another user in #3313 is using T620 and update to 12.2 "only" broke his WiFi.

@bearhntr, sorry for the tag here, but maybe you could check with @aman-sandhu if there are any differences between your setups? Especially the BIOS version might have the biggest impact. Or maybe the boot device used?

Mpgod80 commented 6 months ago

@sairon I have found this about the motherboard: Intel D2500CC AAG81477-401, from OCT 2013

bearhntr commented 6 months ago

@aman-sandhu That is strange, because another user in #3313 is using T620 and update to 12.2 "only" broke his WiFi.

@bearhntr, sorry for the tag here, but maybe you could check with @aman-sandhu if there are any differences between your setups? Especially the BIOS version might have the biggest impact. Or maybe the boot device used?

The 'default' card in the HP T620 is not the Intel AX210 it is a Broadcom card (Realtek chipset) that only supports WiFi 2.4 and 5 and has no Bluetooth. I replaced my card with the AX210 - which works great and it has WiFi 6E and BT 5.3. It was immediately detected, and booting into HA - immediately showed me that BT was there - and the WiFi was showing as well.

When I upgraded to OS v12.2 from v12.1 - and rebooted, the BT was still there but the WiFi was gone. I dropped the OS back to v12.1 from the CLI ha os update --version 12.1 the WiFi came back.

sairon commented 6 months ago

@bearhntr

I replaced my card with the AX210 ...

That should not make a difference here. In short - this issue is that for some people the HP T620 (and HP T630, which seems quite similar in specs) fails to boot completely. This appears to be UEFI firmware issue, that's why I am curious if we could find any differences in the BIOS firmware version or in the setup (e.g. booting from the internal vs. external drive) that would explain it.

SellSan commented 6 months ago

@sairon I set up my haos and really don't want to play with it as I like when everything is fine, and I don't have time to play when it brake, so I'm glad I could replace only one file and make it working. I have my set up done and expecting all works fine, without touching it. If it helps I took photos of the BIOS settings, maybe this will be of any help for you. I'm not Linux specialist at all, I don't want to run in to serious issues with HAOS. 20240417_191356 20240417_191350 20240417_191346 20240417_191337 20240417_191331 20240417_191321 20240417_191316 20240417_191306 20240417_191310 20240417_191302 20240417_191258 20240417_191253 20240417_191248 20240417_191218 20240417_191202 20240417_191152

bearhntr commented 6 months ago

@bearhntr

I replaced my card with the AX210 ...

That should not make a difference here. In short - this issue is that for some people the HP T620 (and HP T630, which seems quite similar in specs) fails to boot completely. This appears to be UEFI firmware issue, that's why I am curious if we could find any differences in the BIOS firmware version or in the setup (e.g. booting from the internal vs. external drive) that would explain it.

When I installed it - I went into the BIOS (I have the lasted installed sp115305.exe [version L40_0219]. I restored Factory Defaults, and then disabled SERIAL, PRINTER, SATA port not used by the M.2 SSD in there and the internal USB ports. I also disabled anything related to LEGACY and cleared the TPM stuff and disabled SECURE BOOT.

I installed it by using RUFUS to write the image to the SSD (while it was out of the machine). I then plugged it in and booted it - took about 20 mins to complete and then I got the web page. Been running this way for nearly 2 years.

aman-sandhu commented 6 months ago

@aman-sandhu That is strange, because another user in #3313 is using T620 and update to 12.2 "only" broke his WiFi.

@bearhntr, sorry for the tag here, but maybe you could check with @aman-sandhu if there are any differences between your setups? Especially the BIOS version might have the biggest impact. Or maybe the boot device used?

I don't use Wi-Fi, and I don't have a Wi-Fi/BT card installed. I have an Ethernet connection. After updating to 12.2, it's giving me the same behavior as OP as below: I see GNU GRUB with 4 options - Slot A, Slot B, Slot A rescue shell, Slot B rescue shell. Selecting any of them results in a message that it's unable to boot. I dont get any CLI options.

sairon commented 6 months ago

@SellSan Well, the last image shows it is not HP T630 but HP T610, with an entirely different CPU. Quite a difference (EDIT: sorry, I notice you corrected your original post few days later - but I still remembered it was T630). However, it also shows the BIOS is at the latest version available from HP, so it's unlikely an update will fix that. The issue is that even though you replaced the file, another OS update will break it again, unless we find the root cause of the problem.

@aman-sandhu Can you also check the BIOS version in the System Information (i.e. as on last picture above)? The version should be 00.02.19 (or similar). If not, try updating to the latest BIOS first, as this one seems to work correctly.