home-assistant / operating-system

:beginner: Home Assistant Operating System
Apache License 2.0
4.72k stars 951 forks source link

Host crashes and goes offline at random times #1336

Closed rousveiga closed 2 years ago

rousveiga commented 3 years ago

Hardware Environment

Home Assistant OS release:

eiiot commented 3 years ago

I've been having the same problem, along with many other people, see #1119. Only current fix is to rollback to a previous version :/

agners commented 3 years ago

Can you check if downgrading helps for you too? Install from scratch and restore the Snapshot or use the following command in the Terminal Add-On, or using a keyboard/screen to access the system terminal and login using root:

ha os update --version 5.4

Also try to monitor memory and CPU usage, e.g. using the system monitor integration.

rousveiga commented 3 years ago

I will try both.

rousveiga commented 3 years ago

The system crashed again yesterday. Here are the history graphs for the memory use, processor use and processor temperature:

memory_percent processor_percent processor_temp

Nothing seems out of the ordinary to me. Those peaks before noon probably have to do with me actively tinkering with the installation (changing the config, rebooting, etc.).

bschatzow commented 3 years ago

Did you roll back to 5.4? It fixed the same issue for me. Also check out the spreadsheet in Github 1119. Lots of people with same issue and nothing in any log that was helpful to any of the developers. I iam currently trying RPIOS with HA as Supervisor. Now at 2 plus days, but only 20 house continuously due to a lot of tinkering. So far it has not froze, or anything else wrong.

rousveiga commented 3 years ago

I'm currently doing so - executing the ha os update command made my system unbootable 😅 trying to back up what I can and putting out the fire now.

I didn't know about that spreadsheet, thank you for telling me! I will take a look at it.

bschatzow commented 3 years ago
https://docs.google.com/spreadsheets/d/1iHTVvaNlTUqwFUgsUhUNws2Sw115INIx5ChEgTnIfoc/edit#gid=0 
https://github.com/home-assistant/operating-system/issues/1119
https://github.com/home-assistant/operating-system/issues/1256

I have run the ha os update --version 5.4 more times than I can remember and never had a system not booting from it. I assume that you see ssh lose connection and then it doesn't restart? What I do is usually have another window that I ping my pi from (i.e. ping 192.168.1.48 -t) and I can tell when it loses connection and when it is starting back up. Sometimes it takes a while. Do you have a monitor attached?

rousveiga commented 3 years ago

Thank you for the links!

Yes, it never restarts. The host stays completely offline after the downgrade, and after power-cycling as well. Seems to be the same situation this user has experienced.

bschatzow commented 3 years ago

At least (for me) the snapshot restore seems to get you back up fairly easily (other than sometimes DB issues). The users here are helpful and I know that @agners is watching and trying to help. What is sad is you are using the SD only and are not having the SSD / controllers most on 1119 are having. Your system should be fully supported. Something changed with the kernel (not sure what) that has been causing issues for many pi users since Nov 2020. Never saw this in the year prior.

rousveiga commented 3 years ago

Yes, I'm grateful for the help I'm finding in these issues. I apologize if I sounded like I was blaming someone else for my system going down. I just panicked a little when the RasPi was offline after the downgrade.

bschatzow commented 3 years ago

I did not take it that you were blaming anyone. Some days frustration hits in very simple things.

outthereandaway commented 3 years ago

I am experiencing the exact same issues as @rousveiga . I am using Homeassistant OS on a Raspberry Pi 4 (4GB, I believe). HA boots from a 64 GB SD card and freezes completely about every day (and seems to take down my entire network to a certain degree). Power cycling does help until the next crash. I tried to downgrade to OS 5.4 but HA did not reboot anymore (ssh also not possible). I am now installing HA OS version 5.4 on the SD card and will then try to restore the latest snapshot. Hope this helps.

bschatzow commented 3 years ago

@rousveiga @outthereandaway , Both of you are running fully supported installations (vs us that are using SSD). This is crazy that you are both having this issue. I was able to run in November upto 5.8 on my SD. I first started having issues when I switched to SSD. I found that by downgrading to 5.4 on my SSD it worked. Nothing after this worked. I tried the "Split system" (SD boot and SSD for everything else and that did not work either.

I am stable now with two different configurations: SSD on HA OS 5.4 or SSD on RPI OS and HA as Supervisor. Both work for me with no issues.

What EEprom are you guys on?
I have the March stable version.
What power supply are you using?
Are you Wifi or wired? Have you tried a different SD chip? I had issues in the past that the memory chip was bad. Only so many writes! I would try this first.

You may want to ask in discord as there is usually very knowledgeable people on line. If you are lucky you can get almost live support rather than waiting for an answer in GIthub or the forums.

rousveiga commented 3 years ago

@bschatzow

What EEprom are you guys on?

Output of vcgencmd:

May 10 2019 19:40:36
version d2402c53cdeb0f072ff05d52987b1b6b6d474691 (release)
timestamp 0
update-time 0
capabilities 0x00000000

What power supply are you using?

A 5V 3A supply from Aukru, this one.

Are you Wifi or wired?

Wi-Fi. The network topology is complex, with a few access points involved, some of which go down at night. (Not the one the Pi is on).

Have you tried a different SD chip? I had issues in the past that the memory chip was bad. Only so many writes! I would try this first.

Yes, when I first set it up I used a 32GB card that I had around. I started getting write errors around the beginning of March, and replaced it with my current SD a couple weeks ago.

You may want to ask in discord as there is usually very knowledgeable people on line. If you are lucky you can get almost live support rather than waiting for an answer in GIthub or the forums.

Thanks for the tip! I will.

I also filled the spreadsheet with my system specifications. Today I will try to set up 5.4 again and hope it works.

bschatzow commented 3 years ago

I would get at least the August 2020 critical eeprom update. I am surprised at the date you posted as it is older than the June 2019 pi4 release date.

rousveiga commented 3 years ago

I got that output from running docker container exec homeassistant /opt/vc/bin/vcgencmd bootloader_version on the host (SSH on port 22222) - that's the correct way to find that info, right?

bschatzow commented 3 years ago

That should work fine. I have a small RPI boot disk that I was using for updates and I used it.

rousveiga commented 3 years ago

I see. @outthereandaway, do you get the same output?

Also, is there anything else I can provide to help diagnose the issue? My system is crashing almost daily, so I can retrieve logs from that; I can also look for a different kind of log if it's available after the crash.

rousveiga commented 3 years ago

Just downgraded to OS 5.3! Will update on the results.

outthereandaway commented 3 years ago

@rousveiga I did not have the time to conduct more analysis. In the meantime I have installed HA-OS 4.2 and it has been running just fine so far. No more issues. Also stats look pretty normal.

Screenshot 2021-04-29 at 13 41 29
rousveiga commented 3 years ago

After downgrading to 5.3, for the first time in about a week, the Pi hasn't crashed! 👏 I'm very happy about this.

bschatzow commented 3 years ago

Go to hear. I could use 5.2, 5.3 and 5.4 with no issues. Nothing has worked for me after this. I also have issues with the Debian March version crashing in several hours with HA as supervisor. The RPI OS and HA as supervisor seems to work for me also.

drenergy commented 3 years ago

I had the same issues, today I downgraded to 5.3 via this command "ha os update --version 5.3", will soon find out if this worked since I had to reboot my Pi everyday.

eiiot commented 3 years ago

I recently moved over to Docker, but I still filled out the sheet with information from my last backup. Hopefully, this gets fixed, it was one of the main reasons I moved...

drenergy commented 3 years ago

Okay, yesterday I downgraded to 5.3 and my Home Assistant on my Raspberry Pi 4 with SSD is still running!

bschatzow commented 3 years ago

@drenergy check out

https://github.com/home-assistant/operating-system/issues/1119

You will see a lot of people with the same issues.

johnny-de commented 3 years ago

Same problem here. I made a watchdog to still have high availability. But now I run into problems with state-based automations. The status is updated with every restart. So I can't trigger automations if an entity has been in the same state for a week or so.

pataar commented 3 years ago

Still having the same problem after upgrading to version 6. Unfortunately I can't seem to downgrade anymore. image

@agners Is there an alternative way of downgrading?

rousveiga commented 3 years ago

Still having the same problem after upgrading to version 6. Unfortunately I can't seem to downgrade anymore. image

@agners Is there an alternative way of downgrading?

@pataar, I found this in the blog post about the beta:

OS 6 can only be downgraded to the lastest OS 5 release 5.13. However, from OS 5.13 it is should still be possible to downgrade to older releases.

Downgrading to 5.13 worked fine (from a new 6.0 installation); however, when I tried to downgrade further to 5.3, I got the same error.

agners commented 3 years ago

@rousveiga @pataar due to the OS rename downgrading lower than 5.13 currently doesn't work (since the URL is part of the version files we download).

However, you can downgrade by downloading the raucb file from the Github release page and store it on a USB flash drive with the name CONFIG. Upon import it will downgrade to that particular version. See: https://github.com/home-assistant/operating-system/blob/rel-6/Documentation/configuration.md

pataar commented 3 years ago

@agners Thanks, will try that!

bschatzow commented 3 years ago

@agners , I am volunteering to help test. Mine and others have this same issue and I believe it has noting to due with the SSD. There is something different on some of the PI boards that is effecting timing (which I believe is the issue). My system with SD / SSD has the same issue as @rousveiga. Her system is just SD so it is not a controller / SSD issue but an issue with how the HA OS was changed after 5.4. My system has been stable with Debian for over 6 weeks with the exact same hardware and the same HA add ons and system setup using supervisor.

If there is something I can help with I would be glad to test and pass on the information. Thanks.

agners commented 3 years ago

I try to get a clearer picture on the reports about Raspberry Pi crashing. It seems that release 6.0 does not particularly improves the situation.

@rousveiga @outthereandaway your case are interesting since those seem not to involve a SSD.

@rousveiga from your comment above it seems that you tried release 6. but it did not show an improvement?

In general, reading various reports and trying to make sense from the spreadsheet I think we are looking at various problems. I am mostly interested in those which did work stable on older versions (5.4 and before). In OS release 5.5 the Raspberry Pi kernel got upgraded from 4.19 to 5.4, as well as the firmware got upgraded from https://github.com/raspberrypi/firmware/tree/7caead9416f64b2d33361c703fb243b8e157eba4 to https://github.com/raspberrypi/firmware/tree/2ba11f2a07760588546821aed578010252c9ecb3.

What would be interesting is using OS release 5.5+ while still using the old firmware. To try out a different firmware, simply replace start4.elf and fixup4.dat on the first FAT partition of the SD card with the downloaded files from this link.

bschatzow commented 3 years ago

Just to be clear I tried every firmware that was marked as stable since September 2020. None of them help me. All versions of HA OS after 5.4 froze after many hours. There was a lot of similar people that fixed it by either going below 5.4 or going to RPI OS or Debian OS. I read @pvizeli comments carefully and I don't believe it is a usb-boot issue. None of us that have been reporting since November are having boot issues. Rosa's issue is booting with SD and nothing else. I did the SD / SSD and it froze after several hours.

agners commented 3 years ago

The reason I ask for a test of this old firmware specifically is since someone claimed that a firmware update was related to the issues @rousveiga is seeing:

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

In Rosa's case this RCU stalls are not normal and concerning, and likely causing the issues. So I'd like to get to the bottom of them.

@bschatzow I think in your case the root cause is a different one.

bschatzow commented 3 years ago

If you can think of anything I can do to help, let me know. I believe many others are willing to help if you can come up with what you need, @TuomasPakkanen, @HumanSkunk, and many others. They have posted detailed information and have figured out how to get a stable HA system.

rousveiga commented 3 years ago

However, you can downgrade by downloading the raucb file from the Github release page and store it on a USB flash drive with the name CONFIG. Upon import it will downgrade to that particular version. See: https://github.com/home-assistant/operating-system/blob/rel-6/Documentation/configuration.md

@agners I see, thank you for your help!

@rousveiga from your comment above it seems that you tried release 6. but it did not show an improvement?

I'm currently testing it on my spare Pi (the setup is exactly the same as with my "main" Pi). So far it's looking good, it's been running for two days without issues; my intention is to leave it up for a couple more weeks, just in case.

The reason why I asked about the downgrade is because I wanted to test that I could downgrade to a stable version before I upgraded my main installation.

What would be interesting is using OS release 5.5+ while still using the old firmware. To try out a different firmware, simply replace start4.elf and fixup4.dat on the first FAT partition of the SD card with the downloaded files from this link.

I will try this as well!

pataar commented 3 years ago

Downgrading from 6.1 to 5.13 to 5.3 using the CLI also works again :) Thanks for fixing that, @agners

rousveiga commented 3 years ago

Seems I spoke too soon. I decided it was finally time to upgrade my main Pi to OS 6, and it crashed that same night. Will test it a couple more times before I downgrade.

Cazimbo commented 3 years ago

I have had the same issue. Downgraded to 5.4 2 days ago. Runs stable again now. But I'm not risking upgrading until this problem is sorted out.

aventustudio commented 3 years ago

For people who don't want to wait: Home Assistant Supervised on RPI4 Debian Buster works great. Migration with a Snapshot worked perfectly fine. https://peyanski.com/how-to-install-home-assistant-supervised-official-way/#How_to_Install_Home_Assistant_Supervised

Cazimbo commented 3 years ago

I already did more installs the past few days than I wanted too. I'll wait 😂. 6.x doesn't bring me anything I urgently need.

bschatzow commented 3 years ago

@aventu90 I agree with your comment. Once I understood the Debian directions I have had zero issues with ha running on it. Using the same hardware. I have had issues since November with HAOS. I'm sure if it crashed for all users the developers could figure out how to fix it. Identical hardware works for some and not others. I has to be a timing issue on some of the pi boards.

kds69 commented 3 years ago

Thanks! I was wondering since weeks which part of my installation was jamming my network and putting down my HA, erratically. I had suspected my separate octopi for some time, and removed MQTT plugin + integration in HA which helped a lot but didn't fix it definitively.

Following the same path: downgrading ha os to 5.13 (5.4 boot is failing, Pi4B 4GB SD card 32gb).

{Edit] Damned, Supervisor Audio is failing due to failing PulseAudio. Known issue when downgrading a supervised HA. Anyone knows how to force rebuild of supervisor? update doesn't work or my below command is wrong: image

rousveiga commented 3 years ago

Anyone knows how to force rebuild of supervisor? update doesn't work or my below command is wrong:

@kds69 I've seen ha supervisor repair thrown around, but I'm not sure of what it does exactly.

kds69 commented 3 years ago

Anyone knows how to force rebuild of supervisor? update doesn't work or my below command is wrong:

@kds69 I've seen ha supervisor repair thrown around, but I'm not sure of what it does exactly.

Thanks but I also tried this as well, but didn't help. What made the trick was to switch to BETA channel which forces supervisor to rebuild for newest beta release (stable 2021.06.6 -> beta 2021.06.8). Slighty risky but much better than unstoppable restart/fail loop of PulseAudio every minute! That may be sufficient for the time being, until next release.

rousveiga commented 2 years ago

Hello! I had my installation "on pause" for a while because of a few network issues, and I'm currently in the midst of putting everything back together.

As a part of that, I upgraded to OS 6.3 and then 6.4. It doesn't seem to have crashed ever since. I will pay close attention to it, and if it works, I'll close this issue.

pataar commented 2 years ago

I've upgraded my Raspberry PI 4 with 2gb memory to a unit with 8gb. Never experienced any problems ever since.

lishan89uc commented 2 years ago

can someone with crashing issue check their cpu temperature? with cat /sys/class/thermal/thermal_zone0/temp? My raspberry pi is running at 102.9 degrees C according to this...

rousveiga commented 2 years ago

HA OS 6.4 has been up for a week with no signs of crashing. It seems to be fixed.