home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
4.93k stars 972 forks source link

GRUB failing to load kernel on Intel Atom boards (Intel NM10 chipset) #3305

Closed HAPSagan closed 2 months ago

HAPSagan commented 6 months ago

Describe the issue you are experiencing

I see GNU GRUB with 4 options - Slot A, Slot B, Slot A rescue shell, Slot B rescue shell. Selecting any of them results in a message that it's unable to boot. I dont get any CLI options. I have used Linux Reader to download the backups from the disk and have then tried to do a fresh installation with OS12.2. The result is the same as after the update - unable to boot from any slot. After that I did a fresh install with OS 12.1 and everything started fine again.

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

12.1

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. Do a fresh install with OS 12.2 or update to OS 12.2 and it's unable to boot.
  2. ...

Anything in the Supervisor logs that might be useful for us?

Can´t get a log.

Anything in the Host logs that might be useful for us?

Can´t get a log.

System information

No response

Additional information

No response

ChernyaevAN commented 6 months ago

I have the same problem with this motherboard and processor. DN2800MT_TechProdSpec05.pdf I use mSATA SSD, but USB flash drive with HAOS have the same error. The BIOS version is the latest.

mwevromans commented 6 months ago

Same issue here. HP T620 wiht a AMD GX-415GA Processor

Solution (that worked for me)

Booted my thin client with USB ubunto OS (google for this how you can make a bootable USB) I could not find the correct drive where the /EFI/BOOT/boot* files were I had to mount the /dev/sda1 (could be different (first create a directory on local drive : cd/
sudo mkdir /mount01 sudo mount /dev/sda1. /mount01 than I could go to /mount01 and found the files there replaces them with the old one (you can download these above in one of the posts). Copied them first to a USB stick and copied them from there to the new mount.

Than after reboot it worked.

Hope this explaination helps other users with this issue...I had to search a bit on how to do this.

dunka commented 6 months ago

I have the same problem on a atom D2500, probably also grub, I rolled back to 11.5 and things are fine.

My board is a Intel D2500CCE Mini-ITX Motherboard, the D2500 is a 64bit cpu.

aman-sandhu commented 6 months ago

@aman-sandhu Can you also check the BIOS version in the System Information (i.e. as on last picture above)? The version should be 00.02.19 (or similar). If not, try updating to the latest BIOS first, as this one seems to work correctly.

I have updated the BIOS on my hp thin client t620 to the version L40 00.02.19. Its all working fine now. Thank you for your help.

feinadam commented 6 months ago

Same issue with Intel® Desktop Board D2700DC.

AlexanderSmi commented 6 months ago

Same problem. Intel D2500HN motherboard, SSD - tried different ones. The haos_generic-x86-64-12.2.img image was written by balenaEtcher. BIOS - I also tried all the versions that I could find. BIOS settings - checked all reasonable ones. The result, unfortunately, is negative.

MrBasque commented 6 months ago

@aman-sandhu Can you also check the BIOS version in the System Information (i.e. as on last picture above)? The version should be 00.02.19 (or similar). If not, try updating to the latest BIOS first, as this one seems to work correctly.

I have updated the BIOS on my hp thin client t620 to the version L40 00.02.19. Its all working fine now. Thank you for your help.

Thank you @sairon and @aman-sandhu, updating BIOS (L40 0219) + resetting BIOS settings to default did the trick for my HP 620T

SellSan commented 6 months ago

@SellSan Well, the last image shows it is not HP T630 but HP T610, with an entirely different CPU. Quite a difference (EDIT: sorry, I notice you corrected your original post few days later - but I still remembered it was T630). However, it also shows the BIOS is at the latest version available from HP, so it's unlikely an update will fix that. The issue is that even though you replaced the file, another OS update will break it again, unless we find the root cause of the problem.

@aman-sandhu Can you also check the BIOS version in the System Information (i.e. as on last picture above)? The version should be 00.02.19 (or similar). If not, try updating to the latest BIOS first, as this one seems to work correctly.

Sorry for that I made mistake, I have HP T610. I'm totally green and don't know much, but maybe it is something related to the BIOS settings explained in this YT video.

pierrepaap commented 6 months ago

any news for the Intel D525MW board ? are we aiming at a fix upstream for those ?

IMDvevey commented 6 months ago

We're trying to gather as much information as possible and get a device where the issue can be reproduced, as this one will be harder to debug remotely and simply reverting the upgrade is not viable in the long term. From the reports here, these need some clarification:

@HAPSagan @Mpgod80 - you only mentioned Intel Atom CPU D2500 - can you please clarify what board/device is it on?

@henrikcaesar - as well, you only said it's an "old Intel Atom motherboard" - could you please be more specific?

@Omri-Kleynhans - you shared no details on the hardware used, could you do so, please?

@SellSan - your device is an odd-ball here, as it shows the same symptoms but has a different CPU than all those Intel Atom-based devices (AMD GX-420GI). Would it be possible to check if a fresh install of HAOS 12.2 doesn't really boot? Also can you check what BIOS versions is it running? I remember seeing more people running these, so I wonder there are no more reports around. Also anyone else running the same device can chime in.

@coc - you stated ThinkCentre m93p which should be an Intel i5 4th Gen. Does it indeed show the same symptoms and only replacing the EFI files helped?

@JitteryDoodle - same as the above - it's an Intel i5 and booting into the other slot should not be possible if the cause is the same as discussed in the issue. Ideally please try performing another upgrade, and if it fails again, open a new issue with logs from the previous boot (ha host logs -b-1 -n 10000).

We're having the same issue with a similar Atom board with integrated components. The exact board was taken from a Fibaro HomeCenter 2 (It's an Intel DN2800MT Desktop Board).

After updating from 12.1 to 12.2, the HA was no longer accessible and after looking, had the same issue stated above where all four boot options (Slot A, B, Rescue A and B) were not booting (No bootable image) or seeming to try to boot but having a black screen with a fixed typing indicator, not being able to actually type anything.

I'm not really great with Linux so I tried copying the folders within the "Supervisor" folder and restoring them to a fresh install of 12.2 (The fresh install works great), at which point the Home Assistant UI is accessible but unusable (see log file of all components freaking out). home-assistant_2024-04-23T12-15-39.892Z.log

Any help would be appreciated if you guys have any idea.

agners commented 6 months ago

@IMDvevey you can recover the device by just overwriting the boot loader on the first partition (see https://github.com/home-assistant/operating-system/issues/3305#issuecomment-2055918375).

sairon commented 6 months ago

The common factor of (almost) all those boards is the Intel NM10 chipset. I managed to get a D525MW boards for testing and found a commit that is breaking it, unfortunately there is no BIOS update that could resolve the issue in the firmware, so until the fix is resolved upstream, the only option is reverting the patch. Thankfully it doesn't seem to be that important for any other functionality, so it might be a way to run GRUB 2.12 on the problematic boards.

Here's a patched version of the bootx64.efi, please give it a try by replacing the current one in /EFI/BOOT/ in the first boot partition: grub_2.12-patched.zip

@SellSan @mwevromans Could you please try if the patched GRUB also loads the system on yours HP T610/T620? Since it's a different chipset, there is possibility it might be a different bug, but let's hope not. Please note that without verifying this, another OS update will break your installation again, so it's worth checking we know what's needed to be fixed.


Btw. when testing the board, I also measured its power consumption which is around 16 Watts when the device is sitting idle :exploding_head: That is an order of magnitude higher than RPi, HA Green, or any modern SBC, that also has ~3-4 times higher computing power. Keep it in mind when calculating the total operating costs of running such old hardware :)

Shad0wguy commented 6 months ago

@sairon So the bug is in Grub 2.12? Is there a bug report there that we can follow?

As for power, I used this just because I had it sitting around already doing nothing. Definitely will get something more power efficient when it dies.

ahmetem commented 6 months ago

i dont now my macbook air chipset or bios settings.

MacBook Air (11-inch, Mid 2011) a1370 64GB flash storage 1.6GHz dual-core Intel Core i5 2GB of 1333MHz DDR3 onboard memory Advanced Intel HD Graphics 3000.

sairon commented 6 months ago

@Shad0wguy Yes, it is a GRUB bug. I will report it to the mailing list once I gather some more feedback here. However, since it needs to be picked up by us in HAOS anyway, you can simply follow this issue to see when it's resolved.

@ahmetem Can you please also check if it boots fine with the patched GRUB binary I posted above?

asjp commented 5 months ago

I did the HAOS update (from 12.0 to 12.2) this morning on my Raspberry Pi 4 and it now doesn't boot. Which sounds like the same issue... In the boot console I see the errors "Bad cluster number 0" "Firmware not found"

I read through all the comments above, but have been unable to get it to boot again. I tried replacing u-boot.bin with the bootx64.efi file from 12.1. Probably the wrong thing to do? That didn't seem to have any effect.

I can see a fix has been recently merged. Is there a way to get my system booting again in the meantime?

AlexanderSmi commented 5 months ago

"Here is the corrected version of the bootx64.efi file. Please try by replacing the current version in /EFI/BOOT/first boot partition: grub_2.12-patched.zip"

I replaced the file - it really works. I used another SSD for the test. But there is a question: if I make the same replacement on a working (configured) SSD, will all the XA settings be saved, or will I have to configure everything again? Or will I need to download a backup?

sairon commented 5 months ago

@asjp That's all wrong from the very beginning and irrelevant to this issue. The error you saw is most probably caused by data corruption on the SD card. Raspberry Pi doesn't use the GRUB loader, so replacing the U-Boot binary with that will brick the device for sure. While this is not an OS bug per se, please open another issue with the description, I can suggest some recovery steps there so it's easier for anyone in the future who encounters it to follow.

@AlexanderSmi Thanks for checking! The files in the boot partition have no impact on your HA settings which are retained in a different partition. Of course, when you're doing something you're not sure about, backup is highly recommended, but in this case if you replace only the single file in the boot partition, you should be fine even without that.

AgentFire commented 5 months ago

Same issue for me. Everything died after HA decided to update.

@ahmetem Can you please also check if it boots fine with the patched GRUB binary I posted above?

Hello. How do I replace the binaries? I don't know nothing about this GRUB.

agners commented 5 months ago

Same issue for me. Everything died after HA decided to update.

What do you mean? If HA updated itself, then it was most likely a Supervisor update. Operating System updates are always user triggered.

Are you sure you encountered the same issue? What hardware are you running on?

AgentFire commented 5 months ago

Well yeah, I click on those "update supervisor, update ha core" etc, since you guys seem to be releasing some new stuff.

Yeah, the same issue, right after some otherwise typical update. image

This interface doesn't seem to support cp command or cat with > symbol, so no option to copy those two files mentioned before into the /efi/boot/ folder. Altho I can see them there with ls /efi/boot.

agners commented 5 months ago

Yeah this is the GRUB bootloader. I don't think it is possible to copy/write files to the boot partition from the GRUB2 console :cry:

So none of these options work? Then this seems indeed the same issue as documented here :cry: . What type of HW is this?

The process to replace the boot loader is to use a Linux live system (e.g. a bootable USB flash drive with Ubuntu), and access the first partition of the internal HAOS disk from there. You can find the previous bootloader version along with path which files to replace in https://github.com/home-assistant/operating-system/issues/3305#issuecomment-2055918375.

AgentFire commented 5 months ago

Yeah, I've managed to load up live Ubuntu from USB and replace those two files, altho they were in 5th partition (sda5). After that the OS started loading correctly. My hardware is some old mini-PC, with a generic x86-x64 HAOS build.

RenEdi commented 4 months ago

I have the same problem, helped me to run in Terminal "ha os update --version 12.1" and "ha core update --version 2024.6.0"

Fujitsu Esprimo Q920 works again, as before

dsoveen commented 4 months ago

Am I right that there are two options to resolve this issue

  1. Downgrade to os 12.1
  2. Replace the bootloader with the file provided in previous post by @sairon

I managed to downgrade my T620 to 12.1 using cli and that did work out fine.

I have spent hours of online research to find out how to update the bios (and failed), but apparently that wouldn't have solved the issue after all.

Is there a solution that we can expect in a future release of the OS?

Grateful for your work.

sairon commented 3 months ago

Since we'd like to come up with a more robust solution that also doesn't affect other platforms, the patch that fixed this issue will be reverted and replaced by a different one that will be eventually adopted by upstream GRUB. Currently one of the GRUB maintainers asked for more information about the boards affected. To get that, few commands need to be run on the device but hopefully it's not too complicated.

Here's a dmidecode.gz which can be used to fetch the required information. Connect keyboard and display directly to the board (or use developer SSH access at port 22222; using the SSH/terminal add-on is not possible), in the HA shell enter login and then run the following commands:

curl -L https://github.com/user-attachments/files/16088348/dmidecode.gz | zcat > /tmp/dmidecode
chmod +x /tmp/dmidecode
mkdir /mnt/data/supervisor/homeassistant/www && ha core restart
/tmp/dmidecode > /mnt/data/supervisor/homeassistant/www/dmidecode.txt

The output can be then downloaded from http://[IP_OF_YOUR_INSTANCE]:8123/local/dmidecode.txt (adjust accordingly if you're using HTTPS, different port, etc.).

@feinadam @AlexanderSmi @HAPSagan @Mpgod80 @dunka @ChernyaevAN You all have different boards, so getting information from yours would be great, or even needed to ensure future HAOS won't break your install again. Thanks in advance for your help!

MohMah commented 3 months ago

@sairon this issue happens on my board too, I've attached the dmidecode file from my system dmidecode.txt

ChernyaevAN commented 3 months ago

mkdir: can't create directory '/mnt/data/supervisor/homeassistant/www': No such file or directory

sairon commented 3 months ago

@MohMah It's the dmidecode binary itself - we need the output of the binary, i.e. what the very last line does. Maybe you prepended the command with cat?

@ChernyaevAN Are you running the commands directly on the device, not through any of the SSH/Terminal addons? In that case /mnt/data/supervisor/homeassistant should be always there :thinking:

RenEdi commented 3 months ago

the command can be run in the "Terminal" in HA?
I also have "mkdir cant create directory mkdir /mnt/data/supervisor/homeassistant/www: No such file or directory" please, what does the command look like

feinadam commented 3 months ago

Since we'd like to come up with a more robust solution that also doesn't affect other platforms, the patch that fixed this issue will be reverted and replaced by a different one that will be eventually adopted by upstream GRUB. Currently one of the GRUB maintainers asked for more information about the boards affected. To get that, few commands need to be run on the device but hopefully it's not too complicated.

Here's a dmidecode.gz which can be used to fetch the required information. Connect directly to the board, in the HA shell enter login and then run the following commands:

curl -L https://github.com/user-attachments/files/16088348/dmidecode.gz | zcat > /tmp/dmidecode
chmod +x /tmp/dmidecode
mkdir /mnt/data/supervisor/homeassistant/www && ha core restart
/tmp/dmidecode > /mnt/data/supervisor/homeassistant/www/dmidecode.txt

The output can be then downloaded from http://[IP_OF_YOUR_INSTANCE]:8123/local/dmidecode.txt (adjust accordingly if you're using HTTPS, different port, etc.).

@feinadam @AlexanderSmi @HAPSagan @Mpgod80 @dunka @ChernyaevAN You all have different boards, so getting information from yours would be great, or even needed to ensure future HAOS won't break your install again. Thanks in advance for your help!

Hello!

It seems i can check it locally earliest in 3 weeks. If there is an option which could be executed via web ssh, let me know and I'll execute it immediately.

sairon commented 3 months ago

@RenEdi No, once again, you need to run it directly on the device, i.e. with keyboard and display connected. The only other option is to use the developer SSH access at port 22222. No other option is viable, both standard and advanced SSH terminal addons do not have direct access to the device's memory/hardware.

(I've updated the previous post to make this bit more clear)

RenEdi commented 3 months ago

thank you for your helpfulness and the procedure. I have in Fujitsu Esprimo Q920 installed on my ssd HA 12.1., so I will run your commands after starting HA12.1 ( with keyboard and display connected), or I need Esprimo Q920 to boot some linux version (Ubuntu, ...)?

edit: and therefore I have to have HA 12.3 installed with which I have problems, because I've done a downgrade to HA 12.1, which works

when a file is created "dmidecode.txt" , I can then access it using PuTTy, so I can post it here?

dunka commented 3 months ago

Since we'd like to come up with a more robust solution that also doesn't affect other platforms, the patch that fixed this issue will be reverted and replaced by a different one that will be eventually adopted by upstream GRUB. Currently one of the GRUB maintainers asked for more information about the boards affected. To get that, few commands need to be run on the device but hopefully it's not too complicated.

Here's a dmidecode.gz which can be used to fetch the required information. Connect keyboard and display directly to the board (or use developer SSH access at port 22222; using the SSH/terminal add-on is not possible), in the HA shell enter login and then run the following commands:

curl -L https://github.com/user-attachments/files/16088348/dmidecode.gz | zcat > /tmp/dmidecode
chmod +x /tmp/dmidecode
mkdir /mnt/data/supervisor/homeassistant/www && ha core restart
/tmp/dmidecode > /mnt/data/supervisor/homeassistant/www/dmidecode.txt

The output can be then downloaded from http://[IP_OF_YOUR_INSTANCE]:8123/local/dmidecode.txt (adjust accordingly if you're using HTTPS, different port, etc.).

@feinadam @AlexanderSmi @HAPSagan @Mpgod80 @dunka @ChernyaevAN You all have different boards, so getting information from yours would be great, or even needed to ensure future HAOS won't break your install again. Thanks in advance for your help!

Turning mine on to get this, also in your instructions you should have mkdir -p, since the data directory isn't always there so it'll need to create all the subdirs.

dunka commented 3 months ago

dmidecode.txt

output attached

sairon commented 3 months ago

@dunka Thanks, that seems correct now. However, using mkdir -p shouldn't be needed. If executed in proper environment (i.e. on the host directly), /mnt/data/supervisor/homeassistant must exist, because that's where the configuration folder of Home Assistant is - it can't be empty or non-existing.

@RenEdi This call is for users of Intel Atom boards. Esprimo Q920 should be hopefully fixed by removing the patch that fixed these boards. Anyway, maybe it could come handy at some point as well. In that case you can also use any other Linux distribution, install dmidecode there and share the output. But it's not necessary - if you follow the above instructions step by step, you can get the output from HAOS too.

ChernyaevAN commented 3 months ago

I do not know what I'm doing wrong. 123

sairon commented 3 months ago

@ChernyaevAN You're still in the HA CLI, you need to switch to the root shell:

in the HA shell enter login and then run the following commands:

ChernyaevAN commented 3 months ago

Here it is. dmidecode.txt

feinadam commented 3 months ago

Since we'd like to come up with a more robust solution that also doesn't affect other platforms, the patch that fixed this issue will be reverted and replaced by a different one that will be eventually adopted by upstream GRUB. Currently one of the GRUB maintainers asked for more information about the boards affected. To get that, few commands need to be run on the device but hopefully it's not too complicated.

Here's a dmidecode.gz which can be used to fetch the required information. Connect keyboard and display directly to the board (or use developer SSH access at port 22222; using the SSH/terminal add-on is not possible), in the HA shell enter login and then run the following commands:

curl -L https://github.com/user-attachments/files/16088348/dmidecode.gz | zcat > /tmp/dmidecode
chmod +x /tmp/dmidecode
mkdir /mnt/data/supervisor/homeassistant/www && ha core restart
/tmp/dmidecode > /mnt/data/supervisor/homeassistant/www/dmidecode.txt

The output can be then downloaded from http://[IP_OF_YOUR_INSTANCE]:8123/local/dmidecode.txt (adjust accordingly if you're using HTTPS, different port, etc.).

@feinadam @AlexanderSmi @HAPSagan @Mpgod80 @dunka @ChernyaevAN You all have different boards, so getting information from yours would be great, or even needed to ensure future HAOS won't break your install again. Thanks in advance for your help!

Here it is mine: dmidecode (1).txt

HAPSagan commented 3 months ago

Hello I apologize for the delay, but here is the file I created. Hope it contains the information you are looking for. The same information also applies to Mpgod80 as we run on the same equipment. Thanks for your work./Peter (HAPSagan) dmidecode.txt

sairon commented 2 months ago

Latest dev build contains a patch that should resolve the issue without any side-effects on other x86 boards (which was unfortunately the case of previous solution, reverted 3 weeks ago). I have tested it on a D525MW but if anyone else with other board could test it (e.g. flash the latest dev to a USB drive and boot from that), we'll be sure the next release doesn't brick bunch of installations again.

sairon commented 2 months ago

Closing as resolved, updated patch is available in beta release 13.0.rc1.

RenEdi commented 2 months ago

After today's update to HA 13.0. my Fujitsu Esprimo Q920 is working fine again. thank you all

ChernyaevAN commented 2 months ago

After today's update to HA 13.0 my DN2800MT falls. What can I do to restore it?

sairon commented 2 months ago

@ChernyaevAN In what way? When you connect a display to it, does it get stuck right after selecting boot entry in GRUB menu? If so, enter GRUB command line (press c before it starts to boot automatically) and post the output of smbios --type 4 --get-qword 8 here. However, the patch includes the value from the output you posted here previously, so it sounds like something different.

ChernyaevAN commented 2 months ago

@ChernyaevAN In what way? When you connect a display to it, does it get stuck right after selecting boot entry in GRUB menu? If so, enter GRUB command line (press c before it starts to boot automatically) and post the output of smbios --type 4 --get-qword 8 here. However, the patch includes the value from the output you posted here previously, so it sounds like something different.

13829424153406670433

sairon commented 2 months ago

@ChernyaevAN Right, I know what's wrong :facepalm: I will create a patch for that, however, that means the original patch only works on D525 :grimacing: Unfortunately, no one with the affected boards answered my call for testing or tried the RC builds of 13.0.

ChernyaevAN commented 2 months ago

@ChernyaevAN Right, I know what's wrong 🤦 I will create a patch for that, however, that means the original patch only works on D525 😬 Unfortunately, no one with the affected boards answered my call for testing or tried the RC builds of 13.0.

Sorry, but I'm not so advanced to try RC's. Do I understand correctly that I need just to wait patch and manual?

sairon commented 2 months ago

@ChernyaevAN It's a bit more complicated, since the current GRUB installed on your machine is faulty. You will need to connect the drive to a different PC, or boot any live USB distro on your Atom, and copy the files from the following archive to overwrite those in /EFI/BOOT folder in the boot partition of HAOS.

grub2-nm10-fixed.zip

Sorry for the complications :cry: