home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
4.82k stars 963 forks source link

After upgrading to 12.0 the system hangs #3197

Closed daviddesmet closed 1 month ago

daviddesmet commented 7 months ago

Describe the issue you are experiencing

I just upgraded to 12.0 and noticed the frontend refused to load. I plugged the mini PC directly into the monitor and rebooted, I got into the Home Assistant CLI. From there I was able to issue some commands and check the frontend (everything loads) and on just after a couple of minutes it just hangs, it doesn't respond to any keyboard input and the frontend is also unresponsive (doesn't load).

I've been running HA for quite some time in this mini PC, no issues till I upgraded. CPU was normally at 2-3 % use, and RAM at 2.5 GB of 32 GB and 12% storage use.

There's nothing in the home-assistant.log.1 that shows an issue, I wonder if I'm able to rollback or something since I'm able to get into the terminal and the HA CLI before it hangs.

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

12.0

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. Upgrade
  2. Wait
  3. Game over

Anything in the Supervisor logs that might be useful for us?

Only a warning about no valid ingress session.

Anything in the Host logs that might be useful for us?

Nothing.

System information

System Information

version core-2024.2.4
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.12.1
os_name Linux
os_version 6.6.16-haos
arch x86_64
timezone America/Mexico_City
config_dir /config
Home Assistant Community Store GitHub API | ok -- | -- GitHub Content | ok GitHub Web | ok GitHub API Calls Remaining | 4983 Installed Version | 1.34.0 Stage | running Available Repositories | 1410 Downloaded Repositories | 28 HACS Data | ok
AccuWeather can_reach_server | ok -- | -- remaining_requests | 44
Home Assistant Cloud logged_in | true -- | -- subscription_expiration | March 3, 2024 at 18:00 relayer_connected | true relayer_region | us-east-1 remote_enabled | true remote_connected | true alexa_enabled | true google_enabled | true remote_server | certificate_status | ready instance_id | can_reach_cert_server | ok can_reach_cloud_auth | ok can_reach_cloud | ok
Home Assistant Supervisor host_os | Home Assistant OS 12.0 -- | -- update_channel | stable supervisor_version | supervisor-2024.02.0 agent_version | 1.6.0 docker_version | 24.0.7 disk_total | 234.0 GB disk_used | 28.7 GB healthy | true supported | true board | generic-x86-64 supervisor_api | ok version_api | ok installed_addons | MariaDB (2.6.1), Studio Code Server (5.15.0), File editor (5.8.0), Advanced SSH & Web Terminal (17.1.1), Node-RED (17.0.7), Home Assistant Google Drive Backup (0.112.1), Mosquitto broker (6.4.0), Nginx Proxy Manager (1.0.1), AdGuard Home (5.0.3), Cloudflared (5.1.4), InfluxDB (5.0.0), Grafana (9.1.3), Glances (0.21.0), Zigbee2MQTT (1.35.3-1), Grott stable branch (2.7) (0.1.7), Frigate (0.13.2), Uptime Kuma (0.12.0)
Dashboards dashboards | 4 -- | -- resources | 13 views | 30 mode | storage
Recorder oldest_recorder_run | February 19, 2024 at 15:56 -- | -- current_recorder_run | February 26, 2024 at 15:34 estimated_db_size | 346.70 MiB database_engine | mysql database_version | 10.6.12

Additional information

No response

daviddesmet commented 7 months ago

I just experimented with stopping some add-ons and noticed the system no longer hangs when I turn off Frigate. I've been running Frigate for a while with an Edge TPU, the resource usage is very low so I find it strange that it is somehow now crashing the host. Will dig a bit more...

agners commented 7 months ago

Can you maybe check Host logs when enabling Frigate?

We did update the kernel, and Edge TPU needs a custom driver which got updated as well. But maybe that new version is buggy 🤔

daviddesmet commented 7 months ago

Hmmm, this is interesting...

I started the Frigate add-on and observed the host logs but didn't show anything new. However, Frigate logs showed a lot of errors trying to read the frames from the cameras until the add-on crashed. I had this time Watchdog disabled, so the add-on wasn't started again. After a couple of minutes, I noticed HA did crash so I had to do a manual reboot.

I've reproduced several times and no useful logs showing up for the host. I used the Terminal add-on and also from the host itself (monitor and keyboard connected directly).

On the last tries, I noticed the system was not hung up but very slow. Each character typed was showing around 15-20 seconds later. Still, no useful logs.

So, it seems your assumption about the driver is correct since it made the OS unresponsive before Frigate stopped, so not related to the Frigate process itself.

daviddesmet commented 7 months ago

Some additional information:

I use the M.2 Accelerator A+E key, I swapped the WiFi PCIe card with this one.

It only needs Frigate to be started once. I haven't started the add-on since then and the system is so far stable.

jesson20121020 commented 7 months ago

I have similar symptoms, but I have no frigate, and do not know how to troubleshoot!

agners commented 7 months ago

@jesson20121020 this issue is clearly Frigate/Edge TPU accelerator related, please open a new issue for your case along with all information (detailed symptom description as well as the type of system you are using).

agners commented 7 months ago

@daviddesmet that is a very interesting observation. Sounds as if the new Linux 6.6 kernel in combination with Edge TPU and the particular PCIe port triggers it? :thinking:

Is the accelerator still used, or is maybe Frigate not using the accelerator since the port change :thinking:

Also, is Frigate without the accelerator on HAOS 12.0 stable otherwise?

In the misbehaving setting, do you see increased memory or CPU usage?

gjobin commented 7 months ago

I am pretty new to HAOS, but experienced the same issue with freezing after upgrading to 12.0 from 11.5. I am running it on a VM in TrueNas Scale.

Symptoms :

Hopefully that helps. Have now reverted to a Snapshot of my VM to restore things up.

Add-ons :

Integrations (Other than default) :

agners commented 7 months ago

@gjobin it seems you are not using a Edge TPU or Frigate add-on, so this is unlikely related with this issue. Please open a new issue so we can investigate separately.

daviddesmet commented 7 months ago

@agners I got some good and bad news.

The good news is that it doesn't seem to be related to the TPU, the bad news, I had disabled and used the CPU instead and experienced the same issue.

In the graph below, you can see a spike in RAM usage when starting Frigate with TPU enabled. As soon as it made the system unstable, I rebooted, disabled the TPU and started Frigate again, the same spike in RAM:

image

image

TPU disabled code:

# detectors:
#   coral:
#     type: edgetpu
#     device: pci:0

Frigate version is 0.13.2, it has been running since the update to 12.0.

sairon commented 7 months ago

So obviously something in the Frigate add-on is misbehaving. You can try checking the memory usage of the processes running in the container by running docker exec -ti addon_ccab4aaf_frigate top directly on the host, hopefully that will reveal the process that's responsible.

TomK commented 6 months ago

i'll try to gather some evidence on this too. i'm having the same problems which at first i thought was a disk corruption, but after disabling frigate for a while i found the crashes stopped. Issues only started after updating OS though. Frigate version remained the same.

TomK commented 6 months ago

possibly related to #3206 No crashing after spending the last week with the frigate addon disabled. I re-enabled it yesterday and it crashed within a few minutes.

After a bit of tinkering I managed to resolve my system crashes by switching away from the "full access" version of frigate, effectively reinstating "protected mode" in the addon.

daviddesmet commented 6 months ago

That's interesting, I don't use the Frigate (Full Access) add-on, I've been using the one just called Frigate and have it left as disabled since the issue came out, as soon as I re-enable it, it crashes and I have to manually reboot. There's no "protection mode' toggle on the one I got installed.

I've tried with every update of HA to see if it gets fixed, but so far, it behaves the same for me.

jarkastr commented 4 months ago

I believe I am having a similar problem to this.

Running haos on an old i7 Intel laptop with the Wi-Fi card replaced with a Google coral tpu. Running frigate on it in unprotected mode.

Every so often (sometimes once a day, most often about once a week) I can't access home assistant. I can see the cli but it is frozen. Only way to get back to home assistant is to hard reset the laptop. Nothing in the home assistant or firgate logs (set to debug) that could be causing this.

I can't seem to figure out how to get to the host log after crash. If someone can point me to some documentation on how to do this, I would be happy to do some digging/monitoring to help get it resolved.

github-actions[bot] commented 1 month ago

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.