home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
4.8k stars 959 forks source link

Unable to boot after upgrade to 11.0 on ODROID-M1 #2822

Closed a1j closed 11 months ago

a1j commented 11 months ago

Describe the issue you are experiencing

There are multiple reports of this issue, please read here.

I

What operating system image do you use?

odroid-m1 (Hardkernel ODROID-M1)

What version of Home Assistant Operating System is installed?

11.0

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. Have 10.5 installed, M1 using dedicated SSD for data.
  2. Upgrade to 11.0 using HASS UI
  3. The system will come up, with root filesystem mounted as read-only and HASS data partition not mounted at all. If you log in via system prompt (hass-cli is not working because containerd is waiting for filesystem mount) you can mount data directory manually, after that homeassistant starts, but with old OS.

If you reboot the system you would have to repeat mount procedure again.

I looked at the journalctl logs and the only error i foudn that "hass filesystem mount by label" job times out and restarts.

BUT this may not mean anything since this is boot of old hass version (upgrade failed at some point).

When ui comes up it asks you to upgdate to 11.0 version again.

It would be nice to know how to fix hass from this half-broken state.

Anything in the Supervisor logs that might be useful for us?

I cannot generate logs because upgrade did not complete and it now runs in semi-broken state.

Anything in the Host logs that might be useful for us?

I looked at journalctl and it says it "times out" while executing mout hass by label job, and then it reports that dependency is broken. But this can be an artifact of incomplete upgrade, i cannot find logs of upgrade anywhere.

System information

Odroid M1, SD card, dedicated data disk (ssd). upgrade from 10.5 to 11.0 using Hass Web UI.

Additional information

No response

prblase commented 11 months ago

Same happened at me. I realiezed after a clean install HAOS can not migrate to ssd data disk, because on restart it is not mounting.

matjahs commented 11 months ago

Same issue here. The machine boots and will say it's waiting for the HA CLI to start. Upon further inspection it seems like there is expecting a disk partition with the label hassos-data. However, the partion that should have that label is labeled hassos-data-dis. When I proceed to manually re-label the partition, it goes off and does its thing, sort of. After a reboot it still doesn't appear to be happy with my data-disk. I ended up doing a clean install and restoring my most recent backup.

regan-a commented 11 months ago

I've hit the same issue. Noticed the same behaviour as @matjahs with the partition label mismatch. Happy to collect logs from my system if that helps get the root of this, just let me know what to grab.

escoand commented 11 months ago

I took this manual workaround:

  1. switch to second console Ctrl+Alt+F2
  2. wait for the prompt
  3. enter root
  4. mount the data disk: mount /dev/nvme0n1p1 /mnt/data
  5. switch back to first console Ctrl+Alt+F2
  6. it should enter emergency mode
  7. enter login

With this i was able to get the system running again, but the update was not installed...

agners commented 11 months ago

Have you all been using the data disk feature when this happened on upgrade?

The relabel to hassos-data-dis should only happen when multiple disks are attached with the same label. It seems that this misfired in you cases (initiated by this script).

The fact that this happened even after restart @matjahs seems to indicate that this is reproducible. I am trying to reproduce it here now.

agners commented 11 months ago

I've installed HAOS 10.5 from scratch, added an NVMe, and then upgraded to 11.0. In my case things worked out fine.

However, I do have a suspicion what the problem is: The script above is triggered by the haos-data-disk-detach.service service. This service should only be run once on very first boot (as mandated by ConditionFirstBoot=yes).

Now for some reason, in your cases, the system thought the OS is booting for the first time :thinking:

The first boot is determined by the U-Boot boot loader on startup, if machine id is not being set. I am currently unclear how this could fail in some situation. Ideally I'd need boot logs capture via a serial console. @regan-a is that something you have the tools for?

Also, to all, do yo use an eMMC or SD card device?

regan-a commented 11 months ago

Hey @agners, thanks for looking into this! I'm running an SD + NVMe for data. Here is a dump of journalctl after boot. Let me know if you need anything else.

boot.txt

prblase commented 11 months ago

Hello @agners ! I'm also use Odroid M1 with SD Card and NVMe SSD. I've already rollbacked to 10.5 . I couldn't do the data migration after a 11.0 clean install, only worked for me on 10.5 I seen the same mislabeled issue (hassos-data-dis).

agners commented 11 months ago

Hey @agners, thanks for looking into this! I'm running an SD + NVMe for data. Here is a dump of journalctl after boot. Let me know if you need anything else.

From the logs I can confirm my suspicion: Your boot loader triggers first boot mode.

... systemd.machine_id= fsck.repair=yes systemd.condition-first-boot=true ..

The question is why exactly. Unfortunately the boot loader output isn't available through logs, only through a serial console.

Can you run the following command on the OS shell?

fw_printenv

Also can you check the SHA256 of the boot script:

sha256sum /mnt/boot/boot.scr

If anyone has access to the serial console of that board which is suffering from the problem, the capture of the boot loader phase would be helpful.

regan-a commented 11 months ago

My apologies, I misunderstood the request. I've attached the outputs of the two commands, plus a serial dump of the boot.

fw_printenv.txt boot.scr.sha256.txt serial boot.txt

Victorsueca commented 11 months ago

I'm having this issue too, OS installed on SD Card and data on NVMe. Everything was working perfectly in 10.5, but as soon as I installed 11.0 it didn't even come back from the mandatory reboot. Now every time I reboot I have to do e2label /dev/nvme0n1p1 hassos-data and systemctl start hassos-supervisor. So I guess my headless setup just turned into a headache.

agners commented 11 months ago

fw_printenv.txt

Hm, it seems the system did write a new machine id, and at least the running OS is able to read the U-Boot environment :thinking:

boot.scr.sha256.txt

This is the correct hash of the boot script in HAOS 11.0, so your boot script doesn't seem corrupted or anything.

serial boot.txt

This does show the problem really:

** Booting bootflow 'mmc@fe2b0000.bootdev.part_2' with script
loading env...
Card did not respond to voltage select! : -110
## Error: bad CRC, import failed
0 bytes read in 1 ms (0 Bytes/s)
0 bytes read in 1 ms (0 Bytes/s)
Loading standard device tree rk3568-odroid-m1.dtb
116634 bytes read in 13 ms (8.6 MiB/s)
Working FDT set to a100000
Trying to boot slot A, 2 attempts remaining. Loading kernel ...
29901312 bytes read in 1564 ms (18.2 MiB/s)
storing env...
Card did not respond to voltage select! : -110
Starting kernel
Moving Image from 0x2080000 to 0x2200000, end=3f30000
## Flattened Device Tree blob at 0a100000
   Booting using the fdt blob at 0xa100000
Working FDT set to a100000
   Loading Device Tree to 00000000ede61000, end 00000000edee5fff ... OK
Working FDT set to ede61000

It seems that the U-Boot bootloader is not able to read the environment. :cry:

JirikP commented 11 months ago

I am having this exact problem. Upgraded to HAOS 11.0 and it never came back. I had a backup on google drive, so i installed the HAOS 11.0 on SD card, restored the backup and from SD card, everything is fine. I just cannot move the Data disk to the SSD (sata). If i try to move it, it will get stuck on waiting for the CLI to get ready.

dbpickles commented 11 months ago

Exact same problem too.

cryptoluks commented 11 months ago

For me, the Web UI Upgrade process from 10.5 to 11 did not result in a reboot (tried it several times). However, after resetting the device days later I had the same issues as described here.

On each boot now, the M.2 SSD partition is disabled and booting is only possible when re-labeling it and restarting the docker containers.

However, it seems that I am still on 10.5 according to the web interface.

EddyBurnett commented 11 months ago

Sadly same issue also. Switched for now to a VM version in Proxmox and restored a backup. Waiting for a future fix or I will stay on the VM and use the Odroid for other purpose.

peterf9 commented 11 months ago

same issue here... cannot move data to NVMe.

gutarin commented 11 months ago

I ran into the same issue on my Odroid-M1

matjahs commented 11 months ago

For me, the Web UI Upgrade process from 10.5 to 11 did not result in a reboot (tried it several times). However, after resetting the device days later I had the same issues as described here.

On each boot now, the M.2 SSD partition is disabled and booting is only possible when re-labeling it and restarting the docker containers.

Same thing here. After each system reboot, I will have a partition labeled hassos-data-old on my SD card and one labeled hassos-data-dis on the NVMe SSD. If I then relabel the disabled one to hassos-data, it continues booting as I would normally expect.

The Settings > About page reports the following:

Home Assistant 2023.10.3 Supervisor 2023.10.0 Operating System 11.0 Frontend 20231005.0 - latest

cryptoluks commented 11 months ago

Could anybody with a build setup provide a boot.scr with a commented-out first-boot check? So we could at least use the m.2 for data without systemd invoking the rename script, right? Correct, @agners? :-)

Thank you very much.

agners commented 11 months ago

@cryptoluks that would be a possible work around, I should be able to create such a script.

Are you using an SD card?

Ki-csen commented 11 months ago

Same issue here, I have just not realized any problem after the upgrade until a hard reboot by a power outage. Then it gets stuck on waiting for the CLI to get ready forever. My one boots from SD and data is on an SSD.

cryptoluks commented 11 months ago

@agners I just created a new boot.scr using mkimage with your change from https://github.com/home-assistant/operating-system/pull/2856.

I can now observe on the hypervisor:

# journalctl -b | grep -i first
Oct 23 18:48:41 homeassistant systemd[1]: HAOS data disk detach was skipped because of an unmet condition check (ConditionFirstBoot=yes).
Oct 23 18:48:43 homeassistant systemd[1]: First Boot Complete was skipped because of an unmet condition check (ConditionFirstBoot=yes).
# dmesg | grep -i "kernel command line:"
[    0.000000] Kernel command line: zram.enabled=1 zram.num_devices=3 systemd.machine_id=[snip] fsck.repair=yes  root=PARTUUID=[snip] rootfstype=squashfs ro rootwait rauc.slot=A

I did not yet move to the m.2 again for data, but it seems to work now after adding mmc dev ${devnum}. The machine_id in the kernel command line was populated and therefore the systemd units were skipped. Awesome!

Here are my steps:

apt-get -y install u-boot-tools
mkimage -T script -A arm64 -C none -n 'Fixed Boot' -d uboot-boot.ush boot.scr

The new binary boot.scr of uboot-boot.ush has to be placed in /mnt/boot/boot.scr.

I also created a base64 encoded version from this here for the adventurous. Simply decode it with curl -s https://gist.githubusercontent.com/cryptoluks/82e2b1c3105c85d91e1c225b8938eca0/raw/0fcfeb5afcf14b8c55fc9ede796d321a40bdea00/uboot-odroid-m1-ha11-fixed.scr.base64 | base64 -d > boot.scr.

# md5sum /mnt/boot/boot.scr
bc45f05698d4fefecbba711da0a71052  /mnt/boot/boot.scr

Update: Works now also with data on the m.2 SSD without any issues.

agners commented 11 months ago

This will be addressed with HAOS 11.1. You can test it already by updating to 11.1.rc1 on the beta channel.

Ki-csen commented 11 months ago

So, I have managed to replace boot.scr and upgrade to HAOS 11.0 but the sda1 still not auto mounted. Any idea?

cryptoluks commented 11 months ago

So, I have managed to replace boot.scr and upgrade to HAOS 11.0 but the sda1 still not auto mounted. Any idea?

Was your disk renamed before to the disabled label? If yes, you probably have to manually rename it back.

Or try the latest Beta with the fixes, maybe this works better for you.

Edit: Ah, if you replaced it and then upgraded, it think the boot.scr was simply replaced with the bugged one from 11.0.

Ki-csen commented 11 months ago

Edit: Ah, if you replaced it and then upgraded, it think the boot.scr was simply replaced with the bugged one from 11.0.

No, the md5sum looks right for the new boot.scr.

Yes it had a label: hassos-data-dis. Finally, I could solve this issue with: e2label /dev/sda1 hassos-data

qwazerty commented 11 months ago

Tested the HAOS 11.1.rc1, I can confirm that it fixed the issue and upgraded successfully on ODROID-M1 with SSD on nvme.

Before the upgrade to 11.1.rc1, when I tried to upgrade to 11.0, after each reboot I had to change label

System information
  OS Version:               Home Assistant OS 10.5
  Home Assistant Core:      2023.10.3

~ # ls -l /dev/disk/by-label/
total 0
lrwxrwxrwx    1 root     root            15 Oct 26 01:09 hassos-boot -> ../../mmcblk1p2
lrwxrwxrwx    1 root     root            15 Oct 26 01:12 hassos-data-dis -> ../../nvme0n1p1
lrwxrwxrwx    1 root     root            15 Oct 26 01:09 hassos-data-old -> ../../mmcblk1p9
lrwxrwxrwx    1 root     root            15 Oct 26 01:09 hassos-overlay -> ../../mmcblk1p8
~ # e2label /dev/nvme0n1p1 hassos-data
~ # ls -l /dev/disk/by-label/
total 0
lrwxrwxrwx    1 root     root            15 Oct 26 01:09 hassos-boot -> ../../mmcblk1p2
lrwxrwxrwx    1 root     root            15 Oct 26 01:12 hassos-data -> ../../nvme0n1p1
lrwxrwxrwx    1 root     root            15 Oct 26 01:09 hassos-data-old -> ../../mmcblk1p9
lrwxrwxrwx    1 root     root            15 Oct 26 01:09 hassos-overlay -> ../../mmcblk1p8

After the upgrade

System information
  OS Version:               Home Assistant OS 11.1.rc1
  Home Assistant Core:      2023.10.3

~ # ls -l /dev/disk/by-label/
total 0
lrwxrwxrwx    1 root     root            15 Oct 26 01:29 hassos-boot -> ../../mmcblk1p2
lrwxrwxrwx    1 root     root            15 Oct 26 01:29 hassos-data -> ../../nvme0n1p1
lrwxrwxrwx    1 root     root            15 Oct 26 01:29 hassos-data-old -> ../../mmcblk1p9
lrwxrwxrwx    1 root     root            15 Oct 26 01:29 hassos-overlay -> ../../mmcblk1p8
arunderwood commented 10 months ago

Incase it helps anyone else, here are the (unoptimized) steps I took to fix my install. I highly suspect there are some unnecessary steps in here but this fixed mine so I can't easily go back and try to shorted them. Normally my setup is headless so I had to hook up a keyboard, monitor, and ethernet:

After the reboot everything worked like a charm.

rklasen commented 10 months ago

Just to make sure, does that already mean HassOS can be booted from nvme directly, without an sd card or an emmc module?

The official docs still say it's impossible, but this repo seems to have gotten it working.