Open rrooggiieerr opened 1 month ago
I just hit this. Pi4 with SSD. I updated from 12.4 to 13.0 and HA never came up. I had to retrieve the pi from where it lives on top of a cupboard and plug it in to a screen and keyboard to see what it was doing. It was booting to the CLI, then after a couple of minutes I was getting the same messages as shown above, and then it rebooted. Left alone, it just kept doing this, HA never came up.
I got in to the CLI and then discovered via command "ha os info" that there are 2 boot slots that it flip flops between when an update is applied. If it fails to boot 3 times it reverts to the other slot, but this problem is obviously letting the os come up far enough to prevent that triggering. I then used command "ha os boot-slot other" to force it to the other slot, which had previous 12.4 in it. The system then came up OK.
I'm dealing with this too! Crashing every few minutes. Occasionally I'll also see it throw up a stack trace too but I don't know of any way to have those traces be saved. I'll try the boot slot change now...
Before:
# ha os info
board: rpi4-64
boot: B
boot_slots:
A:
state: inactive
status: good
version: null
B:
state: booted
status: good
version: "13.0"
data_disk: SN128-0x7e54b116
update_available: false
version: "13.0"
version_latest: "13.0"
After:
# ha os info
board: rpi4-64
boot: A
boot_slots:
A:
state: booted
status: good
version: null
B:
state: inactive
status: good
version: "13.0"
data_disk: SN128-0x7e54b116
update_available: true
version: "12.4"
version_latest: "13.0
Looks like it's back on 12.4 even though the version is null under the boot slot version. Hopefully this fixes it for now...
Edit: Seems to have been reliable back on 12.4 for me. I was experiencing some crashes but I think that's because of another bug in HA Core to do with ESPHome running out of memory and crashing the host. I increased swap and it seems OK.
@dpgh947 Thanks for looking into this, I have just rolled back to 12.4 using ha os update --version 12.4
Using your proposed ha os info
I also see 2 boot slots:
board: rpi4-64
boot: B
boot_slots:
A:
state: inactive
status: good
version: "13.0"
B:
state: booted
status: good
version: "12.4"
data_disk: WX64G-0x14ec9103
update_available: true
version: "12.4"
version_latest: "13.0"
Thanks for the tip of reverting back to 12.4 ! Got the same issue on RPi 3B+ with USB SSD (also tried with a USB HDD). When the Pi is hanging like that, is it possible to extract some logs or so ? Really want to help on this case.
The error messages basically mean the system is too busy to handle kernel tasks. Can you watch memory and CPU usage at the Hardware page to check whether it's not hitting limits before it becomes unresponsive? The common denominator for this report and @dipseth's in the other issue is RPi with 1GB RAM, which is sufficient only for very simple HA setup.
Mine is a pi4 with 2gb, never had a problem before.
The error messages basically mean the system is too busy to handle kernel tasks. Can you watch memory and CPU usage at the Hardware page to check whether it's not hitting limits before it becomes unresponsive? The common denominator for this report and @dipseth's in the other issue is RPi with 1GB RAM, which is sufficient only for very simple HA setup.
I have pi4 with 4GB and was experiencing this. Although, when recording both top
from the os and the info from the hardware tab I was unable to get the device to crash after 1 hour and 20 minutes. Perhaps I was only encountering this when I was doing something on home assistant?
The error messages basically mean the system is too busy to handle kernel tasks. Can you watch memory and CPU usage at the Hardware page to check whether it's not hitting limits before it becomes unresponsive? The common denominator for this report and @dipseth's in the other issue is RPi with 1GB RAM, which is sufficient only for very simple HA setup.
I understand that the RPi 3B+ is quite old and after 5 years of loyal duty I should consider to replace it but still, I'm trying to understand why it happened after the 13.0 update and it's not occuring when I run 12.4.
The ram shortage was quite mitigated by my bigger SWAP config.
In order to downgrade to 12.4, I had to stop every add-ons, and avoid connecting to the UI (otherwise the crash would occur). When I downgraded to 12.4, I re-enabled the add-ons and I have 0 issues since then.
The error messages basically mean the system is too busy to handle kernel tasks. Can you watch memory and CPU usage at the Hardware page to check whether it's not hitting limits before it becomes unresponsive? The common denominator for this report and @dipseth's in the other issue is RPi with 1GB RAM, which is sufficient only for very simple HA setup.
I understand that the RPi 3B+ is quite old and after 5 years of loyal duty I should consider to replace it but still, I'm trying to understand why it happened after the 13.0 update and it's not occuring when I run 12.4.
The ram shortage was quite mitigated by my bigger SWAP config.
In order to downgrade to 12.4, I had to stop every add-ons, and avoid connecting to the UI (otherwise the crash would occur). When I downgraded to 12.4, I re-enabled the add-ons and I have 0 issues since then.
This reminds me that in between when it was crashing and now, where it is no longer crashing, I increased my swap to prevent ESPHome crashes.
The ram shortage was quite mitigated by my bigger SWAP config
I had done something similar, adding a 600mb SWAP file. This is still required in 12.4 but runs without any issues now.
@sairon Are you saying this as a HA OS developer, or just making a statement? What are your sources that 1 GB is not enough? My current memory usage on HA OS 12.4 is 676.3 MiB/75 % with 53 integrations active, 155 devices and 3 add ons active
@EastArctica you can run ESPHome on your laptop or local PC, this wil definitely take some load away from your HA device
@rrooggiieerr
Are you saying this as a HA OS developer, or just making a statement?
I don't know what you mean by either of those, so I will elaborate a bit. On one hand, 1 GB RAM can be fine, from those who opted in for the analytics, 7 % are running on RPi 3 which only has 1 GB of RAM. However, your mileage may vary. 3 add-ons and 53 integrations is still something I'd call conservative usage, as especially the amount of add-ons makes a bigger difference. Compared to you, dipseth has 18 add-ons, and many people are using HA as a self-hosting platform, where RAM starts to be scarce very soon. For other platforms minimum of 2 GB RAM is generally recommended in the docs, for CM4 in Yellow it's recommended as well. From my purely personal experience, only one real HA deployment out of three I manage, only one sits just slightly below 1 GB after a while of usage.
In your case, the remaining 25 % of RAM can become insufficient quite quickly. It also needs to be considered the system usually performs well if it also has some RAM available for page caches, if not, it can lead to higher I/O and combined with swapping on a (rather) slow media, it can be very detrimental to performance. Which leads me to another thought - the SD card you are using - while it's a good choice in terms of endurance - might not be the best for this usage in overall. Per the description it's optimized for use in security cameras, and has no Application Performance Class, so there's no IOPS guarantee. For HAOS it's recommended to use cards of A2 class which perform better in scenarios like this.
In summary, the issue definitely looks like performance related. To see what's going on, having an HDMI display connected, typing login
in the HA CLI and checking free
and uptime
(or just watching top
) when the system becomes unstable could help diagnose if the theory is right or not.
Thanks for the insight. But, if I may, I still don't understand why it is working very well on 12.4 and not on 13.0, like my workload doesn't change between both. That's puzzling me right now :)
@DeXter3306 It's hard to tell without any details about your deployment. Doing what I suggested in the last paragraph of the previous post could help. Also, the kernel doesn't log all information by default, there might be more in dmesg
or host logs (i.e. ha host logs
or journalctl -e
).
In summary, the issue definitely looks like performance related. To see what's going on, having an HDMI display connected, typing
login
in the HA CLI and checkingfree
anduptime
(or just watchingtop
) when the system becomes unstable could help diagnose if the theory is right or not.
I made a backup of my current system and reverted back to one from when I was experiencing issues before and alas! I am experiencing them once again (didn't think I'd be happy to say that). So far I've been doing most of my monitoring and diag through ssh to the OS, but when the host seems to "die" it tends to kill the network too. Through the terminal on the OS itself (via keyboard + HDMI), how can I run shell commands? I'm sorta just stuck in the home assistant CLI and don't know how to break out of it...
I think I can confirm that it is an issue with something HA Core related as the backup I had previously was ONLY HA Core and not a full backup (remember how I said that I couldn't get it to crash before and now I can, although I did have the swap increase but that shouldn't have persisted)
Edit: Just saw memory spike to 3.6GB right before the entire system froze (unable to type in console, ssh wont connect, webserver wont respond) Same thing again: https://i.imgur.com/ljideho.png Got a picture of the stack trace I was talking about before! (I apologize in advance for quality, my system is entirely frozen after all so I can't do much) https://i.imgur.com/LrIZ4kN.png
The stack trace seems to be the system failing to exit correctly because it's out of memory which it's trying to exit because it ran out of memory...
After an hour of testing, it's the whisper model. I don't even know why I had it installed but the whisper addon is 100% what was causing my crashes at least. Currently restoring to my pre-testing backup and I'll reinstall the whisper addon to see if it causes crashes there too, then do the same test with a swap increase.
I tried 13.1 this morning, still broken. HA was just starting to come up, then it rebooted, over and over. This time I managed to get in to ssh and issue the boot-slot other command to get back to 12.4 rather than getting the pi out again and plugging in a screen, so I can't confirm the same messages that I saw before, but whatever it is, I still cannot upgrade my 2gb pi4.
Well, on a hunch, I have been playing around turning off "start at boot" for some addons. I turned off esphome, plex server, wireguard, samba and chose to install 13.1 again. It booted ok. I started esphome and samba manually, all ok. I turned on start at boot for esphome and samba, and rebooted, again came up ok. I started wireguard manually, it started ok. I started plex manually, system immediately stopped responding and then rebooted. Came up OK as I hadn't set those to start. I started plex manually first this time, started ok. Started wireguard manually, came up ok. Go figure...
I don't use plex or wireguard at the moment, so leaving start at boot for those turned off seems to have alleviated the problem. Whether this is due to an actual problem in one of those, or it's just some sort of resource issue during a full boot (maybe the order things happen has changed?) - I have no idea.
EDIT - it rebooted after about half an hour, back to 12.4 again.
Anyone else using the onboard serial port of the Raspberry PI? In my case the Phoscon RaspBee II. I'm starting to think this could be one of the causes
Describe the issue you are experiencing
Since I upgraded my Raspberry Pi 4 to the latest HA OS 13.0 the system freezes regularly with the message
Once frozen I can still ping my HA server, but can't SSH into it anymore
What operating system image do you use?
rpi4-64 (Raspberry Pi 4/400 64-bit OS)
What version of Home Assistant Operating System is installed?
3.0
Did the problem occur after upgrading the Operating System?
Yes
Hardware details
Raspberry Pi 4 with 1 GB memory and WD Purple SC QD101 microSDXC 64 GB storage, original power supply and powered USB hub, Phoscon RaspBee II Zigbee module
Steps to reproduce the issue
about 1 in 4 times the system boots up correctly and I can access HA, but then after some hours it crashes again
Anything in the Supervisor logs that might be useful for us?
Anything in the Host logs that might be useful for us?
System information
No response
Additional information
No response