home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.43k stars 30.67k forks source link

Crashing everyday #43155

Closed RavD666 closed 3 years ago

RavD666 commented 3 years ago

home assistant crashing everyday# The problem

Hass.io keeps crashing everyday,I can see it on my router. Have disconnect the power to boot it back up. Its only started doing it since the last few updates.

How can I get the logs to find the issue

Environment

Problem-relevant configuration.yaml

Traceback/Error logs

Additional information

eizemazal commented 3 years ago

@frenck thanks for suggestion, and pity that systemd logs are not available to users via HTTP frontend and SSH plugin (why? I set up root access via plugin but did not check whoami though). I was sure that there were no logs retained, thought about bothering to set up remote logging with syslogd. Will set up true SSH access or mount and check my SD card when get back home.

muzzak123 commented 3 years ago

Hi, as I mentioned above in previous posts, for me, there is nothing on the HDMI console. You replied 'That makes no sense, the HDMI will show something. Always.' - but there isn't. There's nothing to share. I think this is the problem that I and others are facing. It's not that we're trying to be difficult, it's that we have no way to diagnose or help resolve the problem. I have tried the solution offered at https://developers.home-assistant.io/docs/operating-system/debugging/#checking-the-logs but the instructions are a little vague. The command "docker logs hassos_supervisor" just returns "Error: No such container: hassos_supervisor" The command "# journalctl -f -u hassos-supervisor.service" gives me : Logs begin at Tue 2021-02-02 21:39:51 UTC. -- Feb 09 08:30:14 hassio hassos-supervisor[3532]: 21-02-09 08:30:14 INFO (MainThread) [supervisor.resolution.fixup] System autofix complete Feb 09 08:30:14 hassio hassos-supervisor[3532]: 21-02-09 08:30:14 INFO (MainThread) [supervisor.host.services] Updating service information Feb 09 08:30:14 hassio hassos-supervisor[3532]: 21-02-09 08:30:14 INFO (MainThread) [supervisor.host.network] Updating local network information Feb 09 08:30:17 hassio hassos-supervisor[3532]: 21-02-09 08:30:17 INFO (MainThread) [supervisor.host.sound] Updating PulseAudio information Feb 09 08:30:17 hassio hassos-supervisor[3532]: 21-02-09 08:30:17 INFO (MainThread) [supervisor.host] Host information reload completed Feb 09 08:30:59 hassio hassos-supervisor[3532]: 21-02-09 08:30:59 INFO (MainThread) [supervisor.api.proxy] Home Assistant WebSocket API request initialize Feb 09 08:30:59 hassio hassos-supervisor[3532]: 21-02-09 08:30:59 INFO (MainThread) [supervisor.api.proxy] WebSocket access from a0d7b954_nodered Feb 09 08:30:59 hassio hassos-supervisor[3532]: 21-02-09 08:30:59 INFO (MainThread) [supervisor.api.proxy] Home Assistant WebSocket API request running Feb 09 08:45:06 hassio hassos-supervisor[3532]: 21-02-09 08:45:06 INFO (MainThread) [supervisor.snapshots] Found 6 snapshot files Feb 09 08:59:46 hassio hassos-supervisor[3532]: 21-02-09 08:59:46 INFO (MainThread) [supervisor.homeassistant.api] Updated Home Assistant API token but where do I go from there ?

Tovrin commented 3 years ago

Prior to a HASS.OS patch late last year, I used to see the console on the screen. After that, all I'd see would be the login prompt. I have no idea why that changed. The console just stopped displaying.

it's kind of all academic now anyway. It seems to be rock solid on the NUC. No crashes. I have saved that link for the future should ever I need it.

frenck commented 3 years ago

@frenck thanks for suggestion, and pity that systemd logs are not available to users via HTTP frontend and SSH plugin (why? I set up root access via plugin but did not check whoami though). I was sure that there were no logs retained, thought about bothering to set up remote logging with syslogd. Will set up true SSH access or mount and check my SD card when get back home.

I think with plugin you mean: add-on? If so, add-ons are separate Docker / isolated containers. What you see in there is a different world, and not the host operating system.

frenck commented 3 years ago

@muzzak123 The command "docker logs hassos_supervisor" just returns "Error: No such container: hassos_supervisor"

Try docker logs hassio_supervisor instead.

Please note, that those docs are examples, you can access any logs via the journalctl. As the cause is unknown at this point, it doesn't have to be the Supervisor. Please check other logs as well.

frenck commented 3 years ago

Additionally, if a system becomes unreachable, one could also try to see what the observer says, which runs at http://homeassistant.local:4357 (or http://<IP ADDRESS OF YOUR DEVICE>:4357)

eizemazal commented 3 years ago

@frenck yes, mean add-on, sorry for messing with the terminology, I am new to the project. It makes things clear to me that I was in a separate container after logging in. For now, I do not understand, why run SSH in a separate container, rendering it essentially useless until filesystem parts are explicitly shared with the host (and logs would be one of the first things to share, or why at all use SSH?). But I will read the docs first rather than asking stupid things here in the comments. Just did not think I would run into such a trouble right out of the box with my default installation. I will try to get the logs from the host filesystem and keep posted, thank you for your help.

muzzak123 commented 3 years ago

Ok, I'm new to the journalctl command so forgive me if I'm not au fait with it all. If others are interested I found a very good reference here https://www.digitalocean.com/community/tutorials/how-to-use-journalctl-to-view-and-manipulate-systemd-logs#journal-filtering-by-time I tried journalctl --list-boots and I got 34 records (boots) since 2021-02-02 21:50:24 so it seems to be happening quite a bit. Still not quite sure what I'm looking for. Any chance someone with far more knowledge than I, could layout some basic process/commands to follow to find the log details prior to the last reboot ?

eizemazal commented 3 years ago

Hi @frenck

Just made some experiments. At first, I felt lazy to remove the SD card, so after logging into SSH addon container, I tried to escalate my privileges instead of following the instruction to obtain root shell in a legitimate way. It was easily done by executing three commands:

➜  / docker pull busybox
➜  / docker create busybox
…md5sum here…
➜  / docker run -it -v /:/host/ busybox chroot /host /bin/sh
/ # whoami
root
/var # cd /var/log/
/var/log # ls
journal  private
/var/log # 

docker executable and socket were available, so I pulled a busybox image and run shell in it with unrestricted access to the entire host file system. Technically, I was root in container but with with full access to host FS, and gaining root execution context in host OS is trivial (say by tampering with crontab), so protection from container for SSH addon is nonexistent - probably need to create a security issue if the project relies on it.

While I was happily researching syslog, my problem reappeared, and threw me out of my root shell back into addon container. I could not access Docker any more to start busybox container. A few minutes later I restarted Pi, repeated my trick with privilege escalation, and made a log dump in busybox root shell:

#journalctl --since "2021-02-09 22:00:35" --until "2021-02-09 22:13:50" > /tmp/crash_log.txt
# docker cp /tmp/crash_log.txt 3833bfe6c19d:/tmp

and fetched it via sftp

Here is it attached for what it's worth, but I can hardly understand anything. A lot of things happened between 22:01:53 and 22:02:02 including messages similar to system reboot and messages quoting other messages with other date.

Do you have any idea?

crash_log.txt

muzzak123 commented 3 years ago

I've managed to get my logs out now. putty.log Doesn't seem to be anything obvious I can make out. There is System clock time unset or jumped backwards, restoring from recorded timestamp: after every reboot but I think that's because the RPI doesn't have battery backup for clock. I also notice this https://github.com/home-assistant/operating-system/issues/896 Not sure if it's related but it started around the time my issue started and there's this https://community.home-assistant.io/t/how-to-troubleshoot-crashing-hass-io-on-pi/55113/11 Interestingly PollieKrismis said it started happening around v2020.12 which agrees with my timeline. Also even though I can look at some historical logs now (thanks @frenck ) I don't seem to be able find any logs that take me right up to the point that the previous crash occurred. Can someone suggest what logs would help ?

Salamandar commented 3 years ago

I'm absolutely not sure, but it seems that the sqlite issue was the culprit… I'll wait a week before claiming victory, but well.

muzzak123 commented 3 years ago

@Salamandar I agree with @frenck in that I doubt its sqlite and more likely the new SD card. I switched to MariaDB back in January to see if it made any difference and the problem still persists.

Salamandar commented 3 years ago

@Salamandar I agree with @frenck in that I doubt its sqlite and more likely the new SD card. I switched to MariaDB back in January to see if it made any difference and the problem still persists.

I changed SD card, and that did not help at all.

muzzak123 commented 3 years ago

Sorry @Salamandar it wasn't clear about the SD card because above you said

Ah, and we changed the sdcard used, the previous one probably was the reason for the corrupted db

Anyway I thought about what you said about sqlite and given I've tried pretty much everything. 2 different PSUs, swapped to an SSD, disabled just about every add-on, I thought "What have I got to lose." So I deleted home-assistant_v2.db and log and rebooted and now it has been running for 12hrs straight. This is a record for me given I was having about 4 reboots a day since this started :smiley:

Also @Salamandar when you say

I removed the sqlite as documented somewhere in the community forum

can you please expand on what you did ?

My home-assistant_v2.db was up around 900M before I deleted it and I think you may be onto something. I suspect that when the file becomes too large, the PI cant update it quick enough and somehow crashes. I always have heaps of free memory (around 2.5G free out of 4), my CPU usually sits below 10% and never peaks over 60%, and temp never goes above 50C. I've added commit_interval: 10 to my recorder config to see if this helps

Salamandar commented 3 years ago

@muzzak123 oh ok no problem :)

can you please expand on what you did ?

We regenerated an empty database as it was suggested on the forum, but I can't find the post anymore, and it seems this regeneration is not required at all.

I suspect that when the file becomes too large, the PI cant update it quick enough and somehow crashes.

two things :

I've added commit_interval: 10 to my recorder config to see if this helps

Well… That's an interesting setting. Tell me if that does not cause an issue.

muzzak123 commented 3 years ago

@Salamandar Well that was short lived. I've had 3 more crashes. in the past 2 days, so it did seem to lessen the crash rate but its still there. I'm going back to my "watchdog power point" system. Although not ideal, at least that keeps my automation alive.

CarlosGS commented 3 years ago

Hi @muzzak123 I had a look at your log. It seems there is a lot of addons and devices, but to better understand the current status of your HASS:

Regarding https://github.com/home-assistant/core/issues/43155#issuecomment-752403137 one thing you could try is to leave a monitor connected and logged in to the main terminal (login as root, no password, then type login) monitoring the log in real time with journalctl --follow. When the system crashes you might be able to find the last lines there (if the monitor is still up...).

muzzak123 commented 3 years ago

Hi @CarlosGS. Thanks for your reply and I appreciate your time and help with this. Re Addons, I tried uninstalling many of them to see if it stopped the crashing issue, but it didn't seem to make any difference, so I put them back.

Addons Screen Addons Note: Frigate NVR, Google Assistant Webserver, SSH & Web Terminal and Terminal & SSH are installed but not started

Integrations Screen Integrations

Supervisor Screen Supervisor

System Screen CPU I included this screen to show what is happening with the system at crash time. I had a crash today at about 8:30 and you can notice the flat period where there are no values for about an hour. There doesn't seem to be any spike in activity in the system prior to this crash.

The setup

CarlosGS commented 3 years ago

Thank you, great idea to share CPU/Temp/RAM statistics prior to the crash. Additionally you could try peaking at the history tab, selecting the interval for those early minutes. Maybe the other statuses/datastreams can give you some insight, it's worth you give them a thorough look, for any devices that go unavailable, value fluctuations, etc.

Also, from the devices that you have, what's the fastest output? I'm asking because when using a Raspi I recall having an issue when 2 ESPhome sensors were feeding a sample every second. That caused HASS to go unresponsive, and delaying them solved it.

Next, I think you are currently using MariaDB, but if not, what's the current .db file size?

Finally, I know it's a hassle, but if you could share the log for the minutes before today's crash, that'd be great. And if you could set up the journalctl --follow command mentioned earlier, that should catch any last second errors that happen before the next crash.

(And worst case/inconvenient scenario, you could try setting up a HASSOS virtual machine to see if it hangs too. Not sure that'd be too helpful. Though I did end up switching from the raspi to a VM due to these sort of problems, reboot corruptions, etc)

muzzak123 commented 3 years ago

Hi @CarlosGS, I have setup the HDMI with journalctl --follow so am awaiting the next crash. I have zoomed in on the history just before the crash and didn't find anything unusual. The wind speed increased by 2kph, maybe that's it :smiley::smiley::smiley::smiley: The fastest devices I have would be the PIRs. I have 7 of them connected to the ESP. They are set to 4 second retrigger windows, so the shortest trigger period would be 4 seconds individually, but with movement in multiple areas of the house at the same time it is possible the trigger intervals on the ESP could be less then 4 seconds. But this would only be a few sensors for a very short period and I would not expect them to overwhelm the PI. I use MariaDB currently. I switched to that in an effort to resolve this problem. It happened on SQLite, and it still happens on MariaDB. My recorder config is currently recorder: auto_purge: true purge_keep_days: 3 commit_interval: 10 and the file size is at 565Mb I just had another crash then so here is the leadup log Crash Log.txt

I'm beginning to suspect, as it sounds like you are, that the problem is the PI4 hardware not able to cope. I have read a lot of other comments on the net of people that have found the same. I am considering rebuilding on more powerful hardware such as a NUC. Do you have any recommendations ?

CarlosGS commented 3 years ago

Let's hope you can catch something at the display :S

I don't have too much experience on MariaDB, so can't really advice with that. 565Mb sounds like a lot for 3 days, but maybe MariaDB takes more storage than SQlite.

I do however with ESPhome. In case it is a concurrency issue, I wouldn't mind having a look at the yaml files (do mask out ssid/keys). From the previous capture it seems there are two ESPs? Also let me know the ESPhome version they're currently running.

Regarding HASS hardware recommendations, I run it on a VirtualBox VM within my computer which is on all the time anyways. It takes 1.5GB of ram and a negligible amount of CPU. However, your Raspberry setup (cooled Rpi4, 4GB, USB 3.0 SSD, eth cable) is pretty much ideal and should work.

CarlosGS commented 3 years ago

It seems there's more people with similar issues (https://github.com/home-assistant/operating-system/issues/1201, https://github.com/home-assistant/operating-system/issues/1096, https://github.com/home-assistant/operating-system/issues/1119, https://www.reddit.com/r/homeassistant/comments/kf41se/ha_server_goes_awol_witout_reason/) No solution yet, let's continue debugging with what we can :angel:

muzzak123 commented 3 years ago

Wow - it is somewhat reassuring to know I'm not alone. It does seem quite wide spread and the timeline on https://github.com/home-assistant/operating-system/issues/1119 seems to fit with my issue occurring around the 2020.12.0 update. I'm wondering if there is someway to consolidate all these threads so others can input to help resolve it as well ? Seems they are recommending a downgrade to HassOS 5.4 as a temporary fix. I might try that.

The ESPHome is interesting. I only have one ESP32-WROOM physical device. In HA I disabled 2 gpios that I had configured in ESPHOME but am not using and it seems the integration has split it into 2 devices. gpio The top one contains the disabled entities and the bottom contains the enabled entities. Yaml is here gpio.txt

CarlosGS commented 3 years ago

That's an efficient use of a single ESP32! (I have 18 ESP8266 :sweat_smile:)

Hmm, I'd probably add debouncing on every pin. But maybe don't change anything until you can catch the error on the display.

    filters:
      - delayed_off: 500ms
      - delayed_on: 500ms

I'm very intrigued: why did you add delayed_off: 1000ms only to one of the PIR sensors?

ben8p commented 3 years ago

I'm also suffering from this bug... It was crashing every other day... But since 2 days it's crashing every hours. Like everyone else, I can not find anything useful in the logs...

Additionally, if a system becomes unreachable, one could also try to see what the observer says, which runs at http://homeassistant.local:4357 (or http://<IP ADDRESS OF YOUR DEVICE>:4357)

Tried this but ERR_CONNECTION_REFUSED... Oddly enough (like other people said), a ping still work... But everything else seems dead...

igosoft commented 3 years ago

Like everyone else I'm also having the same problem. It just started after I upgraded to the version including security patches. Everyday since that date, it crashes more frequently - started from once per two days and today twice per 12h. I noticed that the RP3 box is also warm ☹️

muzzak123 commented 3 years ago

@CarlosGS debouncing every pin is a good idea. I probably should do that. The delayed_off: 1000ms is because one of my sensors was playing up for a while giving a short burst of multiple triggers. I used this to "debounce" it 😃. I've since fixed the PIR but never got around to changing the code. The ESP32 replaces an old home security system so all the sensors were wired back to the same place, making it easy to install the ESP32.

Onto the topic at hand. I downgraded to HASSOS 5.4 last night and the system has now been running for 9 hrs solid without missing a beat. I really think the problem is in the new versions of HASSOS

Toukite commented 3 years ago

@frenck could you please just read https://community.home-assistant.io/t/better-logging/185694 ? When multiple people start saying the same thing, maybe it's simply time to start listening ... just saying :)

As for providing the logs, well ... personnally I just can't plug a screen that I don't have + RPI not easily accessible. And I don't have access to host as I run HassOS so it's either on or off, it's not like it was a VM ...

To put it simply, we're just asking if it could be possible that logs (core and supervisor) were made able to survive a restart. Log retention policy isn't really a new IT thing... Anyway, I don't mean to be abrupt, I know all this is voluntary work, but even so, I just don't see why this is becoming such a big deal.

CarlosGS commented 3 years ago

Hi @Toukite, the historic of full logs is saved internally (journald) and can be accessed via the command line. I agree it would be nice to have an easier way to retrieve them through the graphical interface.

However, I want to ask you to please reconsider your attitude towards the project maintainers. I know that these bugs can get the better of us at times - but please remember that this is a vast open source project. It's people giving their work for free. It can take a while to resolve every single issue that occurs, annoying as that can be. Sorry for the inconvenience.

igosoft commented 3 years ago

Here are the logs I have captured IMG_20210216_111333 IMG_20210216_124310

Just wondering if above could destroy my two SD cards since a month 🥺...

ben8p commented 3 years ago

Here are the logs I have captured IMG_20210216_111333 IMG_20210216_124310

Just wondering if above could destroy ma two SD card since a month 🥺...

Same log as you... I checked the card with "badblocks" and it is fine. 0 bad block found...

EDIT: I tried a fresh install 64bit and 32bit... It all starts ok, then suddenly the errors mentioned by @igosoft are dumped endlessly... Once the error is there, homeassistant is gone... even the observer is gone...

Toukite commented 3 years ago

@CarlosGS fully agreed. I was just reacting to previous frenck answers, which were more like "your need is bs I don't wanna hear it and it will never be considered" Than "okay maybe there is something to do here, we'll see if it's feasible and how to plan it. In the meantime, i can help differently etc..." Anyway, that point being made, no need to escalate :)

Toukite commented 3 years ago

the historic of full logs is saved internally (journald) and can be accessed via the command line

@CarlosGS could you please give more details regarding this ? My terminal says "bash: journald: command not found" Thanks !

CarlosGS commented 3 years ago

No problem! As I see it, in open-source projects there is always an inherent negative-feedback loop where all people ends up stressed - breaking that loop is in our hands.

Oh, I miswrote the command, it should be journalctl :angel: (Note however: this bug is particularly annoying, cause when the system freezes for good, the last minute logs don't make it to disk. yikes!)

Toukite commented 3 years ago

Thank you but I have the same error with journalctl :(

CarlosGS commented 3 years ago

Ah - the SSH/Terminal addon can't provide access to the main system, because it runs on a docker container. The "login" command doesn't work there either, to me at least.

To get to the main terminal, the only way I know is via the OS display (if you use a Raspi, then you'd need an HDMI screen and a keyboard). At "homeassistant login" you shoud use root, and no password. Then you'll get to the ha> terminal. And if you type login, now that gets you to the main terminal.

It does feel a bit like the movie "inception" :sweat_smile:

muzzak123 commented 3 years ago

Hi, @CarlosGS I agree. From everything I have seen on my system this particular bug is not making it to disk logs due to the crash. I have rolled my system back to HASSOS 5.4 and it has been running for over 34hrs without missing a beat. From reading comments here https://github.com/home-assistant/operating-system/issues/1119 and my own experience I'm convinced the problem I'm having has been introduced by changes made to HASSOS since 5.4. Perhaps instead of trying to "catch the problem" in phantom logs that may or may not exist, may I suggest a full review of the changes made since HASSOS 5.4, focussing in particular on those that may affect a RPI ?

Oh and @Toukite you can access an RPI via the Putty program using SSH, without a HDMI screen, but obviously with another computer on the network. Putty can also log all activity so the full logs are captured. Although I used this process many times I never had any joy capturing anything meaningful at the time of the crash.

ben8p commented 3 years ago

@muzzak123 following your idea, I re-installed from scratch a 5.5 using https://github.com/home-assistant/operating-system/releases/tag/5.5

So far, running for 12 hours without crashes...

muzzak123 commented 3 years ago

OK - in case anyone is wanting to try this, you just need to rollback HASSOS to 5.4. Everything else remains the same. You can do this by logging on as root and issuing the command ha os update --version 5.4

ben8p commented 3 years ago

OK - in case anyone is wanting to try this, you just need to rollback HASSOS to 5.4. Everything else remains the same. You can do this by logging on as root and issuing the command ha os update --version 5.4

One little note: I started by reverting to 5.4 but then it was not booting at all... Seems like it was a known issue of 5.4. See https://github.com/home-assistant/operating-system/releases/tag/5.4

muzzak123 commented 3 years ago

Dunno - I was just following here https://github.com/home-assistant/operating-system/issues/1119#issuecomment-777667147

5.5 and above changed the Kernel. Most people with the freeze problem run with no issues on any OS version below 5.5.

djdemer commented 3 years ago

Same experience here too... I attempted directly running the ha os update command above from root and after seeing the successful message appear -- it will no longer boot up on my Pi 3 B+. I guess I will reflash the MicroSD and start all over again.

igosoft commented 3 years ago

Yes, a lot happened in 5.5 release and since 5.5. It looks like sw releases should be more granulated eg 5.5.1 and categorised. Mixing kernel updates, bug fixes and new features in one build , will take more time the HA community to finger point the root cause

RavD666 commented 3 years ago

Working fine after a few updates

ben8p commented 3 years ago

Working fine after a few updates

Can you give more information? What updates did you do? And how long it is running fine?

frenck commented 3 years ago

@frenck could you please just read community.home-assistant.io/t/better-logging/185694 ? When multiple people start saying the same thing, maybe it's simply time to start listening ... just saying :)

Everything is logged. Using Linux can be hard maybe, but we use standard logging and everything is logged.

frenck commented 3 years ago

@RavD666 Closed? What did you upgrade to? And from which version? As in, at which point did these issue disappear for you?

Salamandar commented 3 years ago

Yeah… Recreating the sqlite fixed the issue… for some time. It crashed again.

EDIT : I increased the recorder size to 30 days, maybe that's the root of the issue.

muzzak123 commented 3 years ago

@frenck perhaps you can also ask the contributors to https://github.com/home-assistant/operating-system/issues/1119 to help with the logs as this appears to be exactly the same problem. The latest comment from on this thread says...

I have not heard back from any of the developers on this as of yet.

frenck commented 3 years ago

@muzzak123 I'm sorry to hear people have issues, but that issue you linked is polluted with all kinds of things. There is no way that issues will reach a resolvement, as it covers all kinds of things. An issue should contain a single issue. That issue should be closed IMHO.