home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
4.8k stars 959 forks source link

HaOS failes to boot with Network Share #3517

Closed tybo611 closed 1 month ago

tybo611 commented 1 month ago

Describe the issue you are experiencing

Upgraded from 12.4 to beta, and once installed fails to boot, only gives error "system not ready state: setup", with multiple errors for the network share that I had setup for backups. Managed to recover from an old backup, remove the network share and the upgrade succeeds but adding the network share back and restarting all errors return, cannot start supervisor, cannot see any of the major logs on CLI the only error is "system not ready state: setup". I'm aware it could be something I have wrong with my setup specifically, but would like to figure out what logs I can access to see what's going.

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

13.0RC2

Did the problem occur after upgrading the Operating System?

Yes

Hardware details

Proxmox, VM

Steps to reproduce the issue

1.Upgrade from 12.4 to 13 with network share attached

Anything in the Supervisor logs that might be useful for us?

Can't access until downgrade and then nothing useful shows up

Anything in the Host logs that might be useful for us?

Can't access until downgrade and then nothing useful shows up

System information

No response

Additional information

No response

sairon commented 1 month ago

Can you please try reproducing it again and sharing logs from Supervisor and the host? Once the system boots to CLI, type login and run the following commands:

ha host logs -n1000 -b0 >> /mnt/data/supervisor/homeassistant/haos-3517.txt
ha su logs -n1000 -b0 >> /mnt/data/supervisor/homeassistant/haos-3517.txt

Once you boot back to the working system, your config folder (i.e. the folder that contains configuration.yaml) should contain file haos-3517.txt - please check it for errors or share it here as whole.

Eventually you can switch between the two OS versions without running the update using ha os boot-slot other (just check with ha os info that the boot slots contain the version you want).

tybo611 commented 1 month ago

I don't think there's anything useful, do I need to change log level? It's now set up so I can boot back and forth between the two releases. It's also non-production version so happy to test anything I can to help.

haos-3517.txt ![Screenshot 2024-08-12 191325](https://g Screenshot 2024-08-12 191531 ithub.com/user-attachments/assets/5b991c8b-b9aa-4c70-85e1-8fafe1c70835) Screenshot 2024-08-12 191726

tybo611 commented 1 month ago

Screenshot 2024-08-12 191325

This is message that stays up when starting 13.0 and eventually times out, won't allow any HA commands, but can login for shell commands.

sairon commented 1 month ago

Sorry, I haven't realized that without Supervisor fully started you wouldn't be able to run ha logs .... Please run the following instead (after login command):

rm -f /mnt/data/supervisor/homeassistant/haos-3517.txt
dmesg >> /mnt/data/supervisor/homeassistant/haos-3517.txt
docker logs -n1000 hassio_supervisor >> /mnt/data/supervisor/homeassistant/haos-3517.txt

Also, can you ping 192.168.20.125?

tybo611 commented 1 month ago

ping returns the client is alive haos-3517.txt

attached are the logs associated with commands.

mazzy89 commented 1 month ago

Experiencing the same issue. Unfortunately Network Share connected to my Synology NAS broke

× mnt-data-supervisor-mounts-smb_backups.mount - Supervisor cifs mount: smb_backups
     Loaded: loaded (/run/systemd/transient/mnt-data-supervisor-mounts-smb_backups.mount; transient)
  Transient: yes
     Active: failed (Result: exit-code) since Wed 2024-08-14 14:21:24 UTC; 1min 45s ago
      Where: /mnt/data/supervisor/mounts/smb_backups
       What: //ad6.zbraslav.lan/homeassistant
        CPU: 14ms

Aug 14 14:21:23 homeassistant systemd[1]: Mounting Supervisor cifs mount: smb_backups...
Aug 14 14:21:24 homeassistant mount[23620]: mount error(128): Key has been revoked
Aug 14 14:21:24 homeassistant mount[23620]: Refer to the mount.cifs(8) manual page (e.g. man mount.cifs) and kernel log messages (dmesg)
Aug 14 14:21:24 homeassistant systemd[1]: mnt-data-supervisor-mounts-smb_backups.mount: Mount process exited, code=exited, status=32/n/a
Aug 14 14:21:24 homeassistant systemd[1]: mnt-data-supervisor-mounts-smb_backups.mount: Failed with result 'exit-code'.
Aug 14 14:21:24 homeassistant systemd[1]: Failed to mount Supervisor cifs mount: smb_backups.
sairon commented 1 month ago

@mazzy89 That doesn't seem related, it rather looks like an ACL issue: mount error(128): Key has been revoked The dmesg output or checking the NAS logs might give you more details. It doesn't look like a problem with network at this point.

@tybo611 I wonder if it could be this issue: https://bugzilla.kernel.org/show_bug.cgi?id=219129 On what OS/machine is the SMB server running? Eventually, today's dev will contain a fix for that (it was a stable kernel regression fixed in 6.6.46 released literally few minutes ago).

mazzy89 commented 1 month ago

@sairon it started exactly after the upgrade. Never had such issues before and the NAS is perfectly up and running.

sairon commented 1 month ago

@mazzy89 Please try if it is indeed a regression and 12.4 works correctly. You can use ha os boot-slot other to swap back and forth between versions.

Anyway, I'm not saying the issue is not there, it just manifests in a different way (OP's SMB server is simply unreachable), so please don't mix it up here and open another issue.

jdesai61 commented 1 month ago

I have the same issue - I am running HAOS in a VM on Proxmox on Intel NUC with some NFS mounts (from QNAP NAS). Updating to this OS broke HA starting and symptoms are same as reported in first post here. I managed to fix it by reboot and typing "ha banner" (weird - but it works https://community.home-assistant.io/t/error-returned-from-supervisor-system-is-not-ready-with-state-setup/413084/124). However, HA is showing HA OS 13.0 available for update. I am going to wait and watch this issue here.

tybo611 commented 1 month ago

I have the same issue - I am running HAOS in a VM on Proxmox on Intel NUC with some NFS mounts (from QNAP NAS). Updating to this OS broke HA starting and symptoms are same as reported in first post here. I managed to fix it by reboot and typing "ha banner" (weird - but it works https://community.home-assistant.io/t/error-returned-from-supervisor-system-is-not-ready-with-state-setup/413084/124). However, HA is showing HA OS 13.0 available for update. I am going to wait and watch this issue here.

I'm guessing you actually booted into the other partition as @sairon mentioned and I've been doing to check. after the CLI is loaded and on the main screen type os info, does yours look similar to this where you have a bad boot partition and you have booted from the 12.4 partition. Screenshot 2024-08-12 191726

tybo611 commented 1 month ago

@mazzy89 That doesn't seem related, it rather looks like an ACL issue: mount error(128): Key has been revoked The dmesg output or checking the NAS logs might give you more details. It doesn't look like a problem with network at this point.

@tybo611 I wonder if it could be this issue: https://bugzilla.kernel.org/show_bug.cgi?id=219129 On what OS/machine is the SMB server running? Eventually, today's dev will contain a fix for that (it was a stable kernel regression fixed in 6.6.46 released literally few minutes ago).

Similar to @jdesai61, I'm also running a VM in proxmox, intel 8th gen additional details below. using a windows 11 laptop to access HaOS. Screenshot 2024-08-14 165123

I can try to upgrade the bad partition to the new dev release and see what happens. I'll get to it tonight and provide error logs or positive outcome.

larry-glz commented 1 month ago

@tybo611 anxious to see if this is successful. i had the same issue as you preventing the Supervisor from starting: 194.0571611 CIFS: UFS: 1\192.168.20.125 has not responded in 180 seconds. Reconnecting... with a different IP though. Unfortunately, i did not know about ha os boot-slot other and restored a day-old VM backup - i basically lost a day's worth of data.

tybo611 commented 1 month ago

Unless I'm missing an easier method(aside from waiting for release), I'll have to change some settings and self sign the dev build. Sound right?

sairon commented 1 month ago

@tybo611 Thanks for the effort - there should have been a dev release available for couple of hours already. However, something's wrong at Cloudflare and it simply refuses to serve the raucb image which is needed for OTA. With that, it would be a matter of a single HA CLI command to update to that version (ha os update --version 13.1.dev20240814). Most likely some caching issue I can not resolve myself :cold_sweat:

If you want to go down the rabbit hole, there's a way to build your own OS build but as it's a VM, it might be easier to create a new one from the latest dev image, setup a share and see if it fails there as well, and if not, run ha os update --version 13.0 to downgrade to 13.0 to confirm it was indeed a kernel regression fixed by today's Linux release.

sairon commented 1 month ago

Eventually, you can also try downgrading to older dev builds - use the valid versions listing from dropdown at the artifacts page. I'm particularly interested if 13.0.dev20240802 contains the issue or not (for this you can simply run ha os update --version 13.0.dev20240802). It should have kernel 6.6.43 which doesn't contain backported commit net: missing check virtio yet.

tybo611 commented 1 month ago

now that git has recovered. I'm able to download the newest dev environment.

@sairon, appreciate walking through these, I haven't used dev builds before but enjoying the learning with it. updated the OS to Aug14 build and it booted successfully and the network share is still attached and accessible. Screenshot 2024-08-14 204938

is it worth going to the Aug02 build to check that kernel or are we good with the new 6.6.46? admittedly I can't remember which build exactly i started having issues as i upgraded to one of pre-releases quickly and realized the issue rolled back and started trying to see if it was something in my setup specifically.

sairon commented 1 month ago

@tybo611 Thank you for checking! That's good news, that means the kernel bump helped and it was probably the GSO issue above that caused the issues. Checking Aug 02 build would just help to confirm that the regression was introduced in 6.6.44, it could give us some assurance but it's not really needed.

jdesai61 commented 1 month ago

I have the same issue - I am running HAOS in a VM on Proxmox on Intel NUC with some NFS mounts (from QNAP NAS). Updating to this OS broke HA starting and symptoms are same as reported in first post here. I managed to fix it by reboot and typing "ha banner" (weird - but it works https://community.home-assistant.io/t/error-returned-from-supervisor-system-is-not-ready-with-state-setup/413084/124). However, HA is showing HA OS 13.0 available for update. I am going to wait and watch this issue here.

I'm guessing you actually booted into the other partition as @sairon mentioned and I've been doing to check. after the CLI is loaded and on the main screen type os info, does yours look similar to this where you have a bad boot partition and you have booted from the 12.4 partition.

Mine boots - it just couldn't start HA core for some reason - with same error message as OP mentioned. However,I can login to it using Console on Proxmox web gui. So perhaps I have a slightly different problem.

Screenshot 2024-08-15 142153

tybo611 commented 1 month ago

@sairon, can confirm the 02Aug build also works. I have the VM now with 02Aug and 14Aug builds both booting back and forth, starts immediately and has no issues with network share.

jdesai61 commented 1 month ago

By accident, I issued "ha os update" command, which ended up re-installing 13.0 and now I can't get HA to startup. Even "ha os info" just hangs. How do I try rebooting from Slot A vs Slot B?

tybo611 commented 1 month ago

you're running proxmox right? shutdown the VM, it'll fail a few things but eventually shutdown. then from Proxmox console webpage for the VM, start it up and hit a key while it's in the loading phase. that will stop the process at the boot menu and you'll be able to select the other slot. helps if you know which one you were booting from but you should be able to base it off the "tries" number.

jdesai61 commented 1 month ago

Ok I managed to reboot into right slot with "ha os boot-slot B" command and now it gets further. But after boot, HA core won't start (I get "Error: System is not ready with state: setup"). However, if I type "ha banner" - then all is well and core starts. How can I get it to start core automagically?

tybo611 commented 1 month ago

My thoughts, though i'm not an expert: if your booted, now, why not run the update command with specific version to downgrade to the 12.4 release that (assuming) you had no issues with.

think it'd be something like the ha os update --version 12.4 command. that should overwrite the bad boot slot as well and you will boot into that, once the new 13.1 build comes out you can push that via normal OTA. you could just hit skip on the 13.0 build if you wanted to not see the message all the time.

jdesai61 commented 1 month ago

Thanks - will do

Taomyn commented 1 month ago

I have the same issue since restarting HA this morning, but was able to boot to the other slot to get working again. How do I clean up the broken slot so I don't end up with it remaining after eventually upgrading to a fixed version? I wasn't even able to run 'os info' or anything else when booted into 13.0. just got the 'state: setup' message

image

tybo611 commented 1 month ago

HA won't boot to the other slot again until it's told too during update or by command. If you want two working slots, you can run the update command to version 12.4(command above) or one of the dev versions confirmed to be working(command also above); the update will write to the inactive slot. Otherwise wait until the new release is made and it will override the bad slot then.

Taomyn commented 1 month ago

If you want two working slots, you can run the update command to version 12.4(command above) or one of the dev versions confirmed to be working(command also above); the update will write to the inactive slot.

Yes I got that but how? It won't let me:

image

tybo611 commented 1 month ago

Use this instead... ha os update --version 13.1.dev20240802 It's a dev build but I believe it was mentioned that wasn't much different than stable build. Once it's booted you can issue the command to boot to other slot again and you be back on the 12.4. Then when update happens it will override dev build

dpgh947 commented 1 month ago

There are comments in this other issue regarding installing a dev build, you can't do it while set to the stable channel - https://github.com/home-assistant/operating-system/issues/3528

I just noticed in that pic of trying to get 12.4 in both slots, apparently failed but it now says "version_latest: 12.4" instead of 13.0.......?? And "update_available: false" instead of true...

Uschinator commented 1 month ago

i have the same problem and opened a bug report for it. Maybe someone can merge it into this bug (https://github.com/home-assistant/operating-system/issues/3524)?

I solved my supervisor not starting issue with "supervisor reload" after 1 or 2 minutes and then i removed the Network Folder that i used for backups and after another reboot the system is running and starting normal.

Using the network folder (NFS Share on a Qnap NAS) again and the supervisor did not start and had to reloaded manually.

mxr commented 1 month ago

With 13.1 released are you still seeing the issue?

Uschinator commented 1 month ago

I tested it and the problem seems to be gone. did two restarts and the supervisor starts in less than a second. Thanks 👍

Taomyn commented 1 month ago

I can confirm 13.1 is working well for me too, and that it also replaced the broken slot, kept the working 12.4 and switched:

image

sairon commented 1 month ago

The root cause of the issue have been resolved by the kernel update in 13.1 released yesterday. If anyone's having issues looking similar to this one, it's likely something else, please open a new issue with complete description in that case.