home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
5.02k stars 982 forks source link

System completely hangs after upgrading from 11.5 to 12.0 #3206

Closed gjobin closed 4 months ago

gjobin commented 8 months ago

Describe the issue you are experiencing

Followed this tutorial to initially install HAOS on TrueNAS scale as a VM.

Host :

Symptoms :

Add-ons :

Integrations (Other than default) :

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

11.5

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. Upgrade
  2. Wait
  3. System Hanging

Anything in the Supervisor logs that might be useful for us?

Can't really do that once upgraded, cause the system is unresponsive, including the VM console.

Anything in the Host logs that might be useful for us?

Can't really do that once upgraded, cause the system is unresponsive, including the VM console.

System information

System Information

version core-2024.2.4
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.12.1
os_name Linux
os_version 6.1.74-haos
arch x86_64
timezone America/New_York
config_dir /config
Home Assistant Community Store GitHub API | ok -- | -- GitHub Content | ok GitHub Web | ok GitHub API Calls Remaining | 5000 Installed Version | 1.34.0 Stage | running Available Repositories | 1407 Downloaded Repositories | 6 HACS Data | ok
Home Assistant Cloud logged_in | false -- | -- can_reach_cert_server | ok can_reach_cloud_auth | ok can_reach_cloud | ok
Home Assistant Supervisor host_os | Home Assistant OS 11.5 -- | -- update_channel | stable supervisor_version | supervisor-2024.02.0 agent_version | 1.6.0 docker_version | 24.0.7 disk_total | 48.5 GB disk_used | 6.0 GB healthy | true supported | true board | ova supervisor_api | ok version_api | ok installed_addons | Studio Code Server (5.15.0), Advanced SSH & Web Terminal (17.1.1), Cloudflared (5.1.4), Home Assistant Google Drive Backup (0.112.1)
Dashboards dashboards | 1 -- | -- resources | 0 mode | auto-gen
Recorder oldest_recorder_run | February 23, 2024 at 1:45 AM -- | -- current_recorder_run | February 26, 2024 at 6:22 PM estimated_db_size | 37.74 MiB database_engine | sqlite database_version | 3.44.2

Additional information

No response

fwartner commented 8 months ago

Can confirm. Restults in crashes randomly.

jmcollin78 commented 8 months ago

Seems to have the same on a RPI4. After install of HAOS 12.0, the HA don't boot anymore. I have a PI4 which boots on a SSD.

EDIT : A hard reset solves the issue

sairon commented 8 months ago

I tried installing 12.0 from scratch then restore my backup (partial or complete) with the same result after a bit.

Does it mean it only starts to happen after you restore the configuration, but the vanilla OS doesn't show these issues?

Can you share any details of the HW the TrueNAS OS is running on?

The symptoms are similar to out-of-memory issues, do you have any insights about the memory usage of the VM?

Silther commented 8 months ago

Seems to have the same on a RPI4. After install of HAOS 12.0, the HA don't boot anymore. I have a PI4 which boots on a SSD.

EDIT : A hard reset solves the issue

Do you mean you reinstalled home assistant?

gjobin commented 8 months ago

I tried installing 12.0 from scratch then restore my backup (partial or complete) with the same result after a bit.

Does it mean it only starts to happen after you restore the configuration, but the vanilla OS doesn't show these issues?

Can you share any details of the HW the TrueNAS OS is running on?

The symptoms are similar to out-of-memory issues, do you have any insights about the memory usage of the VM?

I have not tried to create a fresh config on the fresh installation, but it did boot up and allow me to restore the configurations, yes. I also did not wait longer to validate if it would fail after a while, sitting waiting for initialization.

This is my host machine currently running all my apps and HOAS 11.5 image

My current 11.5 VM is configured this way image

With the 12.0 OS crashing, I did try to bump both Minimum Memory Size and Memory Size to 6 GiB, without success.

EDIT : This is what the /config/hardware page shows in 11.5 : image

Silther commented 8 months ago

thanks, looks like this wasn't/isn't the problem with my device. For me all addons seem to be broken and as I tried to access it via a proxy manager no connection could be established.

sairon commented 8 months ago

@gjobin This really looks like the HA VM goes out of memory - the Memory graph in HA does show the actual memory consumption (without buffers/caches), so if it's hovering around 98%, it means it's getting out of memory and probably swapping heavily, showing the symptoms you describe. Here's memory usage of my instance, running way more custom integrations and add-ons than yours:

image

It can't be ruled out that the OS update triggered something to misbehave, for start I will start restarting HA in the safe mode to check if any custom integrations isn't to blame. But most likely the memory consumption was always on the edge even in 11.5 and with some of the recent changes it just went too high.

I also recommend setting the "Minimum memory size" and actual "Memory size" to the same value for the VM. I expect this to disable memory ballooning, i.e. the hypervisor will allocate the fixed amount of RAM instead of increasing it on demand. This can also rule out some lower-level issues.

teijosantala commented 8 months ago

I have similar issues, os crashes randomly (less than once per day). Here is the call stack of latest crash: VirtualBox_HA_04_03_2024_16_39_34

Seems to be related to usb. I took out my bluetooth adapter to see if that is the cause.

gjobin commented 8 months ago

It can't be ruled out that the OS update triggered something to misbehave, for start I will start restarting HA in the safe mode to check if any custom integrations isn't to blame. But most likely the memory consumption was always on the edge even in 11.5 and with some of the recent changes it just went too high.

I also recommend setting the "Minimum memory size" and actual "Memory size" to the same value for the VM. I expect this to disable memory ballooning, i.e. the hypervisor will allocate the fixed amount of RAM instead of increasing it on demand. This can also rule out some lower-level issues.

It makes sense to allocated static memory, so I did set a fixed amount of RAM to 8GiB (for now) and here what it looks like now on 11.5. Interestingly I have added more integrations/plugins on 11.5 than when I started this thread. I am pretty new to HA and was still adding integrations.

image

How do I start in safe mode, after updating, once it is crashing and unresponsive ?

gjobin commented 8 months ago

I just redid the update to 12 with 8GiB and it seems stable so far.

Here is current usage

image

Would you think it's Okay to bring it back to 4 GiB ?

gjobin commented 8 months ago

Also, in between my initial report of the issue, there has been both, a Core and a Supervisor update. Iwonder if they might have fixed any potential Memory issue.

Edit : changed "Core and a Supervisor issue" to "Core and a Supervisor update"

sairon commented 8 months ago

I just redid the update to 12 with 8GiB and it seems stable so far.

Hmm, that looks good indeed. I wonder if there isn't something wrong with the ballooning driver in the newer kernel :thinking: If you're willing to do some more tests, could you set the "minimum memory size" to 512M again and check if it starts to eat the RAM again?

Would you think it's Okay to bring it back to 4 GiB ?

My guess is that it should be okay to do so. I'd say that many people run it on systems with that (or even lower) amount of RAM.

Also, in between my initial report of the issue, there has been both, a Core and a Supervisor issue. Iwonder if they might have fixed any potential Memory issue.

I am not aware of any recent issues in Core or Supervisor causing memory to leak, so likely not.

gjobin commented 8 months ago

Changed it back to 512MiB /4GiB. System is hanging again. Moved to 1GiB/8GiB, this is what I see on the hardware page:

image

It seems to me that your assumption is right.

And to further prove it, I set it back to 4GiB/4GiB without any issues.

image

Glad it's working for me now. But it seems at least my issue is reproductible. Which is always a good news .

kevtuning commented 8 months ago

Thank you for your investigation !

I have the same issue with proxmox from 12.0 I will definitely check my VM memory config when back home... (I know that there's 4GB allocated but I am not sure about minimum and I don't have access to it from the office)

redzoro01 commented 8 months ago

Same issue when running 12.0 on virtual machine manager of a synology NAS. There is definitely a memory problem and several processes are killed by the kernel´s Out Of Memory Killer. You can see that on console messages. These are not allways the same processes. Sometimes it is even impossible to get a console connection and a complete virtual machine power cycle is required. Downgrade to 11.1 or 11.5 and everything works fine.

vuisme commented 7 months ago

Same problem. My pi3+ rape me from 12.0 update. Random rebooting

xoatrash commented 7 months ago

Same here. Running on Synology VMM and got freezes randomly every few hours since the latest Update. Seems like it run out of memory, because I once got that message in the console.

IMG_8754

And the I got problems like:

ha > [ 7413.4383091 CIFS: VFS: server 192.168.20.2 does not advertise interfaces I 7413.440727] CIFS: VFS: server 192.168.20.2 does not advertise

or

ha > [30798.303328] systemd[1]: systemd-resolved.service: Watchdog timeout (limit 3min)! [30899.5774191 systemd-coredump[39466]: Process 109 (systemd-journal) of user 0 dumped core.

And sometimes there is no message, because the console is frozen.

This also happens with 12.1.

github-actions[bot] commented 4 months ago

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.