home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
4.86k stars 965 forks source link

Upgrade to OS 5.9 on OVA caused corrupted EFI folder #1125

Closed mj-sakellaropoulos closed 2 years ago

mj-sakellaropoulos commented 3 years ago

Just updated to OS 5.9 via UI, vm no longer boots. VM is on Proxmox 6.2, OVA, UEFI OVMF

Upon investigation in ubuntu, garbage data was found in the EFI folder : image

Will dd the boot partition from release page and report back, suspect update process is broken somehow ?

( as first reported here: https://github.com/whiskerz007/proxmox_hassos_install/issues/96 )

mj-sakellaropoulos commented 3 years ago

After repairing the boot partition, the EFI file system seemed intact but I am stuck on the barebox bootloader and 100% CPU usage : image

Update: Booting manually via GRUB command line as specified in system1 reveals the system is completely broken, the update never completed (os-release still 5.8). docker, homeassistant, networkmanager and other services do not start.

agners commented 3 years ago

There are two partition (A/B update system), you might have booted the old 5.8 release.

Did you by chance had to reset/force poweroff the VM? Can you reproduce the issue? You are not the only report along those lines, see #1092. I use libvirt (which uses kvm underneath) and did a bunch of updates using OVA, I wasn't able to reproduce this issue.

agners commented 3 years ago

Which version did you upgrade from?

mj-sakellaropoulos commented 3 years ago

5.8 to 5.9 via UI

I booted system0 and system1 via GRUB, let me know if there are other procedures to follow for booting specific versions.

The vm was not forced off by me, it did the update, corrupted EFI and rebooted. When i looked at VNC, it was saying cannot find boot entry.

mj-sakellaropoulos commented 3 years ago

Just to clarify from my perspective there are the following multiple failures:

If there are any log files I can provide let me know.

I will try to repro this issue to extract some more data.

I should also mention that the initial EFI corruption has broken OVMF detection on proxmox 6.2, the disk had to be migrated to a new VM to be detected even with repaired EFI.

~~I have updated proxmox to latest (6.3) I have installed 5.8 via importing qcow2 into proxmox, barebox bootloader is still broken~~

mj-sakellaropoulos commented 3 years ago

MAJOR UPDATE :

The ONLY issue was the EFI corruption although the cause remains unknown Some hints: the directory listing of corrupted EFI contain strings like "Attempt 7" which are found in NvVars and are also part of barebox boot process (?)

HassOS EFI Recovery Guide

If your EFI is corrupted (you get the message cannot find QEMU HARDDISK etc..) this procedure may help:

Dangerous step is next, double check partition size and BACKUP your disk FIRST !

fdisk -l dd if=/dev/nbd0p1 of=/dev/sda1

- Now mount the repaired /dev/sda1 to verify contents : 

mkdir /mnt/hass-boot mount /dev/sda1 /mnt/hass-boot ls -al /mnt/hass-boot/EFI

- if everything looks good, unmount and shutdown

umount /mnt/hass-boot qemu-nbd --disconnect /dev/nbd0 shutdown now


- IMPORTANT: Remove the ISO and IDE DVD from the VM before rebooting
- The VM should boot normally, you may need to ```systemctl start docker``` to get the hassio CLI working
agners commented 3 years ago

Thanks for posting the update and instructions how to recover the boot partition!

I changed to use the sync mount option when mounting the boot partition, hoping things get written out immediately after update (see #1101). Although, since you did not force off/force reboot, there must have been something else causing the corruption. Maybe it is some sort of kernel bug. It it is that, then I hope the latest Linux kernel stable update (part of 5.9) fixes it. But if you have hints/ideas what could have caused the corruption in first place (or if you have a process to reproduce) I would be very interested to hear.

mj-sakellaropoulos commented 3 years ago

I am unable to reproduce the issue, the only thing I could suggest is to implement a sanity check after writing to the boot partition, look for the bootx64.efi file, and check if it fails (?)

I did a clean install of 5.8 and updated via CLI to 5.9, it worked normally on latest proxmox. Very strange. Could be some incompatibility between specific old versions of proxmox and barebox maybe?

tumd commented 3 years ago

Same issue here running on a libvirt VM. I seem to have been able to restore the boot partition with help from @mj-sakellaropoulos informative post.

markkamp commented 3 years ago

Same thing happened to me running HassOS as VM on a Unraid server (6.8.3). The fix from @mj-sakellaropoulos worked as a charm. (Thank you!) So my hass-boot partition was also corrupted.

When reverting to a backup image, I did ended up reproducing the problem. This was when I updated OS 3.12 to 5.9. Al seemed fine when updating. But after power cycling the VM, nothing. Wouldn't boot any more. So could be as @mj-sakellaropoulos suggested compatibility issues between older versions?

lexathon commented 3 years ago

I had the same upgrading from 4.17 running on an esxi vm. I didn't bother recovering the EFI and simply rolled back my hard drive image to the backup.

lexathon commented 3 years ago

Same again on 5.10 (as iopenguin mentioned already). Interestingly this time the machine booted fine after update and was stable but after a power cut it failed to come back online. I guess the EFI wasn't needed for a soft reboot after the update. I used mj-sakellaropoulos workaround (using 5.10) to recover the EFI on this occasion as I'd made some changes I wanted to keep - thanks for that by the way.

RubenKelevra commented 3 years ago

I don't think that's an fsck issue, but a failure to properly sync the data from the write cache of the VM to the storage while the VM shuts down.

I've seen this before with libvirt and fixed it in my setups by running a sync as root on the machines after anything as been updated.

And wait a while with sleep after an update has been completed.

Else the system partition might look like this:

20210208_201647 20210208_201930

ahknight commented 3 years ago

I'm running a VM on Proxmox and this presents as clearing the MBR but leaving GPT intact. From a Proxmox shell I can use fdisk on the zvol that hosts the VM and it will rebuild the MBR, which lets the VM boot. But I have to do this after every HA OS upgrade.

GJT commented 3 years ago

Having the same issue under Proxmox. Happens every other week. @ahknight What command do you run exactly to fix it?

ahknight commented 3 years ago
$ gdisk /dev/zd##

Then just write out the MBR again and try again.

agners commented 3 years ago

@RubenKelevra

I don't think that's an fsck issue, but a failure to properly sync the data from the write cache of the VM to the storage while the VM shuts down.

That is what I thought as well, but we also see it on Intel NUCs. Also when rebooting a proper sync should be done on reboot anyways, and at least some people claimed they did a proper reboot but still experienced the issue...

I've seen this before with libvirt and fixed it in my setups by running a sync as root on the machines after anything as been updated.

This is essentially what https://github.com/home-assistant/operating-system/pull/1101 does, by mounting the hole partition sync. That went into 5.9, but the issue still appeared afterwards.

@ahknight just to clarify: the images uses UEFI to boot, there is no "MBR". MBR is a DOS partition table/BIOS concept. In UEFI, there is just a FAT partition called EFI System Partition (ESP), which has to have the right files in the right place. The UEFI BIOS then picks up the boot loader from there. No "magic" master boot record (MBR) needed. I guess you refer to the ESP here.

@GJT to fix a qcow2 image, you can follow the instructions in https://github.com/home-assistant/operating-system/issues/1125#issuecomment-750457231.

RubenKelevra commented 3 years ago

@agners interesting.

We might experience two separate issues here:

The consumer grade SSDs do have a write cache which is not protected by a battery backup.

If the shutdown process is (basically) too fast we might write this to the write cache and cut the power to the device before the SSD had time to flush it on the permanent storage.

There are some Intel SSDs which have a power backup built-in to avoid this.

They call this "enhanced power-loss data protection".

It's probably pretty racy in most setups, so we might have this issue everywhere but it only shows symptoms by a very small chance.

We could debug this by writing a file unsynced when the shutdown is initiated. If it's gone when we startup, we know something fishy is going on.

If it's still there we delete it.

Anyway, I think we could mitigate this issue if we use a hook on shutdown when the FSes are unmounted. If we just add a - say 5 second - sleep afterwards even the slowest SSD should have plenty of time to write everything from the write cache to the disk.

ahknight commented 3 years ago

@agners I know how the startup process is supposed to work. However, I'm explaining what I did. Proxmox got stuck in an EFI boot prompt loop until I SSHed into the PX host, ran gdisk on the zvol, read the error message that said the MBR was missing, wrote out the MBR that it recovered from the GPT tables, and then started the VM again. Suddenly it worked.

We can argue should forever, but that did do it. Repeatedly.

GJT commented 3 years ago

For me this issue occurs every 2-4 weeks, the system becomes unresponsive out of the blue and i'am greeted with the corrupt efi on a reset. I usually roll back to my working snapshot (OS 5.12) which I can reboot as much as I like. But after some time it gets corrupted again without any updates or changes to the System.

It even occurs on different proxmox cluster nodes that use different storage systems.

hcooper commented 3 years ago

HassOS EFI Recovery Guide

Thanks @mj-sakellaropoulos your recovery instructions worked well, and I managed to recovery a botched upgrade.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

RubenKelevra commented 3 years ago

This issue isn't stale.

mj-sakellaropoulos commented 3 years ago

Hah. The infamous stale bot is at it again.

I don't think this is solved and I don't think we know what causes it...

Reports have slowed though, is this still happening frequently?

agners commented 3 years ago

Please report if you experience this with updates from 6.0.

Eeems commented 3 years ago

I've just experienced this, and I last upgraded from 6.0 to 6.1. I'm not convinced that an upgrade is causing this for me, as I successfully rebooted the vm multiple times after the upgrade with no issues. What I did notice yesterday is that the filesystem had gone read-only and required a reboot. After the reboot it seemed to run fine, but when I woke up this morning the vm was powered off, and required rebuilding the partition table in order to boot again.

This seems to be a semi-weekly occurrence for me.

github-actions[bot] commented 3 years ago

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

meichthys commented 2 years ago

I'm late to the party, but I still experience this regularly when rebooting the host machine. My temporary fix is:

## Make sure VM is disabled:
ha-manager set vm:<VMID> --state disabled
## Open GDISK to modify disk partition map
gdisk /dev/zvol/rpool/vm-<VMID>-disk-<DISK#>
## Once GDISK opens, then just use the `W` command to re-write the partiion map
## Re-enable (start) VM to verify the VM boots using the disk
ha-manager set vm:<VMID> --state enabled
agners commented 2 years ago

Did this happen with a recent OS version?

meichthys commented 2 years ago

Did this happen with a recent OS version?

Yes: image

For me it happened somewhat regularly after the host machine is rebooted, but last night I noticed it happened even without a host machine reboot.

I did notice this: image

agners commented 2 years ago

100% for /dev/root is normal since we use a read-only squashfs as root file system.

github-actions[bot] commented 2 years ago

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

RubenKelevra commented 2 years ago

@agners wrote:

This is essentially what #1101 does, by mounting the hole partition sync. That went into 5.9, but the issue still appeared afterwards.

Yeah. The issue is outside of the VM. The write request gets cached for "performance reasons" including the fsync. When the machine is turned off, the cache won't get flushed but instead just discarded.

I've seen this a lot of times in KVM, not sure if that's a Linux kernel bug or somewhere in the emulation layer of the disk itself.

Last time I've seen this is like 3 years ago. I just keep the machines running for several minutes before rebooting – which fixed this for me.

agners commented 2 years ago

The issue is outside of the VM. The write request gets cached for "performance reasons" including the fsync.

"gets cached" by whom?

If its hardware, then it's broken hardware. The OS needs to be able to rely on flush stuff to the underlying non-volatile storage, otherwise the whole stack of cards fall apart (journaling file systems won't be able to implement consistency guarantees, databases ACID breaks).

If its the VM's virtual disk driver, than that VM disk driver is buggy or reckless. Granted, you might want that option so you can trade performance for reliability if you really don't need any relyability (e.g. for testing). But it shouldn't be the default, and it should not be configured for Home Assistant OS :)

KVM/Qemu has quite some tunables in that domain. SUSE seems to have a nice write-up about the options. I highly doubt though that "non-safe" options are used in Proxmox by default...

RubenKelevra commented 2 years ago

"gets cached" by whom?

So my assumption is that the virtual harddrive does write-back caching for performance reasons and do not fully flush the cache before destroying the virtual harddrive.

Linux also writes something out like

sd 0:0:0:0: [sda] No Caching mode page found
sd 0:0:0:0: [sda] Assuming drive cache: write through

On certain hardware. Which is just not true for SD Cards, which do some kind of write back caching.

KVM/Qemu has quite some tunables in that domain. SUSE seems to have a nice write-up about the options. I highly doubt though that "non-safe" options are used in Proxmox by default...

Yeah not by intention but because of a bug somewhere. Making sure that writes are atomic without data journaling / copy-on-write is kinda hard.

I stopped using ext4 on lvm for this reason and switched to zfs and the issue went away.

RubenKelevra commented 2 years ago

Btw ext4 got a mount option to fix some application issues: auto_da_alloc. But I don't think this will cover block based replacements.

              Many broken applications don't use fsync() when replacing existing files via patterns such as

              fd = open("foo.new")/write(fd,...)/close(fd)/ rename("foo.new", "foo")

              or worse yet

              fd = open("foo", O_TRUNC)/write(fd,...)/close(fd).

              If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and force that any delayed allocation blocks are allocated such that at the next journal commit, in the default data=ordered mode, the data blocks
              of  the  new  file  are  forced to disk before the rename() operation is committed.  This provides roughly the same level of guarantees as ext3, and avoids the "zero-length" problem that can happen when a system crashes before the delayed allocation
              blocks are forced to disk.
agners commented 2 years ago

So my assumption is that the virtual harddrive does write-back caching for performance reasons and do not fully flush the cache before destroying the virtual harddrive.

Yeah, that would explain it, but it would be a big fat bug IMHO. I mean, just throwing away caches when the VM gets destroyed seems a major oversight. I doubt that this is what is going on.

On certain hardware. Which is just not true for SD Cards, which do some kind of write back caching.

This issue is about virtual machines though. Also, SD cards are exposed as mmcblk. I don't think that the kernel makes such assumptions for those type of devices.

I stopped using ext4 on lvm for this reason and switched to zfs and the issue went away.

Keep in mind that the boot folder is FAT. Also, it is mounted sync now, so writes should go out immediately today.

With OS 8.x we switch to GRUB2 boot loader and to the latest Linux kernel, let's see if reports appear with that combination.

Sesshoumaru-sama commented 2 years ago

I tried to update from Hass OS 8.2 to 8.4 today (Proxmox VM). System did not boot after that, landing into the EFI shell. Had to restore a previously made snapshot and now its up again. This issue is really worrysome and persistent.

meichthys commented 2 years ago

I haven't noticed this issue recently, but the following has always worked when falling the the Efi shell: https://github.com/home-assistant/operating-system/issues/1125#issuecomment-990611098

Sesshoumaru-sama commented 2 years ago

I haven't noticed this issue recently, but the following has always worked when falling the the Efi shell: #1125 (comment)

I have no folder /dev/zvol/rpool/ Is it just the path to the vm-disks -- I have them on LVS .. so this? /dev/pve/vm-100-disk-0 (mapped to -> ../dm-10

Do I need to do it for the data disk or also for the EFI disk (disk-1)?

meichthys commented 2 years ago

I've only ever done it on the darts disk, but my disk was zfs. Be sure to take a backup of the cm before trying it on your lvm/s disk.

Sesshoumaru-sama commented 2 years ago

Odd that nobody else with LVS had this issue and could give a hint. I will try to restore the VM on another proxmox instance and see what happens - really frustrating to have so low level issues that stuff does not boot...

GJT commented 2 years ago

Just had the issue again after a long time without problems after upgrading from 8.2 to 8.4. Proxmox/ZFS

agners commented 2 years ago

We really don't do anything special with that partition other than writing some files to it right before rebooting. Rebooting should properly unmount the disk, which should cause all buffers to be properly flushed. Can you check if the file system checks were all good before the upgrade, e.g. using the following command in the console/HAOS ssh shell:

journalctl -u "systemd-fsck@dev-disk-by\x2dlabel-hassos\x2dboot.service"
journalctl -u mnt-boot.mount
GJT commented 2 years ago

Unfortunately both only contain entries after the upgrade

QsSFmIa

Going to check next time before an upgrade.

sylarevan commented 1 year ago

Hi there. I must report this bug seems still present. After a host (proxmox 7.3-6) reboot, my Home Assistant VM was not able to boot anymore. I get the message: BdsDxe: failed to load Boot0001 "UEFI QEMU HARDDISK QM00005 " from PciRoot (0x0) /Pci (0x7,0x0) /Sata (0x 0,0xFFFF,0x0) : Not Found

The solution, as mentionned there was to check the disk table with: gdisk /dev/pve/vm-101-disk-1 and then simply w

After that, the VM was able to boot again.

agners commented 1 year ago

After a host (proxmox 7.3-6) reboot

Was that a graceful reboot or a power cut?

If the former, can you reproduce this with each reboot?

sylarevan commented 1 year ago

This was a host graceful reboot. I have not rebooted since (I'm a bit afraid to not being to properly recover the VM this time), but I will test. FYI this is the first time I have this problem in about 3 years, and many many HA updates.