Closed RavD666 closed 3 years ago
@frenck thanks for suggestion, and pity that systemd logs are not available to users via HTTP frontend and SSH plugin (why? I set up root access via plugin but did not check whoami though). I was sure that there were no logs retained, thought about bothering to set up remote logging with syslogd. Will set up true SSH access or mount and check my SD card when get back home.
Hi, as I mentioned above in previous posts, for me, there is nothing on the HDMI console. You replied 'That makes no sense, the HDMI will show something. Always.' - but there isn't. There's nothing to share.
I think this is the problem that I and others are facing. It's not that we're trying to be difficult, it's that we have no way to diagnose or help resolve the problem. I have tried the solution offered at https://developers.home-assistant.io/docs/operating-system/debugging/#checking-the-logs but the instructions are a little vague.
The command "docker logs hassos_supervisor" just returns "Error: No such container: hassos_supervisor"
The command "# journalctl -f -u hassos-supervisor.service" gives me :
Logs begin at Tue 2021-02-02 21:39:51 UTC. -- Feb 09 08:30:14 hassio hassos-supervisor[3532]: 21-02-09 08:30:14 INFO (MainThread) [supervisor.resolution.fixup] System autofix complete Feb 09 08:30:14 hassio hassos-supervisor[3532]: 21-02-09 08:30:14 INFO (MainThread) [supervisor.host.services] Updating service information Feb 09 08:30:14 hassio hassos-supervisor[3532]: 21-02-09 08:30:14 INFO (MainThread) [supervisor.host.network] Updating local network information Feb 09 08:30:17 hassio hassos-supervisor[3532]: 21-02-09 08:30:17 INFO (MainThread) [supervisor.host.sound] Updating PulseAudio information Feb 09 08:30:17 hassio hassos-supervisor[3532]: 21-02-09 08:30:17 INFO (MainThread) [supervisor.host] Host information reload completed Feb 09 08:30:59 hassio hassos-supervisor[3532]: 21-02-09 08:30:59 INFO (MainThread) [supervisor.api.proxy] Home Assistant WebSocket API request initialize Feb 09 08:30:59 hassio hassos-supervisor[3532]: 21-02-09 08:30:59 INFO (MainThread) [supervisor.api.proxy] WebSocket access from a0d7b954_nodered Feb 09 08:30:59 hassio hassos-supervisor[3532]: 21-02-09 08:30:59 INFO (MainThread) [supervisor.api.proxy] Home Assistant WebSocket API request running Feb 09 08:45:06 hassio hassos-supervisor[3532]: 21-02-09 08:45:06 INFO (MainThread) [supervisor.snapshots] Found 6 snapshot files Feb 09 08:59:46 hassio hassos-supervisor[3532]: 21-02-09 08:59:46 INFO (MainThread) [supervisor.homeassistant.api] Updated Home Assistant API token
but where do I go from there ?
Prior to a HASS.OS patch late last year, I used to see the console on the screen. After that, all I'd see would be the login prompt. I have no idea why that changed. The console just stopped displaying.
it's kind of all academic now anyway. It seems to be rock solid on the NUC. No crashes. I have saved that link for the future should ever I need it.
@frenck thanks for suggestion, and pity that systemd logs are not available to users via HTTP frontend and SSH plugin (why? I set up root access via plugin but did not check whoami though). I was sure that there were no logs retained, thought about bothering to set up remote logging with syslogd. Will set up true SSH access or mount and check my SD card when get back home.
I think with plugin you mean: add-on? If so, add-ons are separate Docker / isolated containers. What you see in there is a different world, and not the host operating system.
@muzzak123 The command "docker logs hassos_supervisor" just returns "Error: No such container: hassos_supervisor"
Try docker logs hassio_supervisor
instead.
Please note, that those docs are examples, you can access any logs via the journalctl. As the cause is unknown at this point, it doesn't have to be the Supervisor. Please check other logs as well.
Additionally, if a system becomes unreachable, one could also try to see what the observer says, which runs at http://homeassistant.local:4357
(or http://<IP ADDRESS OF YOUR DEVICE>:4357
)
@frenck yes, mean add-on, sorry for messing with the terminology, I am new to the project. It makes things clear to me that I was in a separate container after logging in. For now, I do not understand, why run SSH in a separate container, rendering it essentially useless until filesystem parts are explicitly shared with the host (and logs would be one of the first things to share, or why at all use SSH?). But I will read the docs first rather than asking stupid things here in the comments. Just did not think I would run into such a trouble right out of the box with my default installation. I will try to get the logs from the host filesystem and keep posted, thank you for your help.
Ok, I'm new to the journalctl
command so forgive me if I'm not au fait with it all. If others are interested I found a very good reference here https://www.digitalocean.com/community/tutorials/how-to-use-journalctl-to-view-and-manipulate-systemd-logs#journal-filtering-by-time
I tried journalctl --list-boots
and I got 34 records (boots) since 2021-02-02 21:50:24 so it seems to be happening quite a bit.
Still not quite sure what I'm looking for.
Any chance someone with far more knowledge than I, could layout some basic process/commands to follow to find the log details prior to the last reboot ?
Hi @frenck
Just made some experiments. At first, I felt lazy to remove the SD card, so after logging into SSH addon container, I tried to escalate my privileges instead of following the instruction to obtain root shell in a legitimate way. It was easily done by executing three commands:
➜ / docker pull busybox
➜ / docker create busybox
…md5sum here…
➜ / docker run -it -v /:/host/ busybox chroot /host /bin/sh
/ # whoami
root
/var # cd /var/log/
/var/log # ls
journal private
/var/log #
docker executable and socket were available, so I pulled a busybox image and run shell in it with unrestricted access to the entire host file system. Technically, I was root in container but with with full access to host FS, and gaining root execution context in host OS is trivial (say by tampering with crontab), so protection from container for SSH addon is nonexistent - probably need to create a security issue if the project relies on it.
While I was happily researching syslog, my problem reappeared, and threw me out of my root shell back into addon container. I could not access Docker any more to start busybox container. A few minutes later I restarted Pi, repeated my trick with privilege escalation, and made a log dump in busybox root shell:
#journalctl --since "2021-02-09 22:00:35" --until "2021-02-09 22:13:50" > /tmp/crash_log.txt
# docker cp /tmp/crash_log.txt 3833bfe6c19d:/tmp
and fetched it via sftp
Here is it attached for what it's worth, but I can hardly understand anything. A lot of things happened between 22:01:53 and 22:02:02 including messages similar to system reboot and messages quoting other messages with other date.
Do you have any idea?
I've managed to get my logs out now.
putty.log
Doesn't seem to be anything obvious I can make out.
There is
System clock time unset or jumped backwards, restoring from recorded timestamp:
after every reboot but I think that's because the RPI doesn't have battery backup for clock.
I also notice this https://github.com/home-assistant/operating-system/issues/896
Not sure if it's related but it started around the time my issue started
and there's this
https://community.home-assistant.io/t/how-to-troubleshoot-crashing-hass-io-on-pi/55113/11
Interestingly PollieKrismis said it started happening around v2020.12 which agrees with my timeline.
Also even though I can look at some historical logs now (thanks @frenck ) I don't seem to be able find any logs that take me right up to the point that the previous crash occurred. Can someone suggest what logs would help ?
I'm absolutely not sure, but it seems that the sqlite issue was the culprit… I'll wait a week before claiming victory, but well.
@Salamandar I agree with @frenck in that I doubt its sqlite and more likely the new SD card. I switched to MariaDB back in January to see if it made any difference and the problem still persists.
@Salamandar I agree with @frenck in that I doubt its sqlite and more likely the new SD card. I switched to MariaDB back in January to see if it made any difference and the problem still persists.
I changed SD card, and that did not help at all.
Sorry @Salamandar it wasn't clear about the SD card because above you said
Ah, and we changed the sdcard used, the previous one probably was the reason for the corrupted db
Anyway I thought about what you said about sqlite and given I've tried pretty much everything. 2 different PSUs, swapped to an SSD, disabled just about every add-on, I thought "What have I got to lose." So I deleted home-assistant_v2.db and log and rebooted and now it has been running for 12hrs straight. This is a record for me given I was having about 4 reboots a day since this started :smiley:
Also @Salamandar when you say
I removed the sqlite as documented somewhere in the community forum
can you please expand on what you did ?
My home-assistant_v2.db was up around 900M before I deleted it and I think you may be onto something. I suspect that when the file becomes too large, the PI cant update it quick enough and somehow crashes. I always have heaps of free memory (around 2.5G free out of 4), my CPU usually sits below 10% and never peaks over 60%, and temp never goes above 50C. I've added commit_interval: 10 to my recorder config to see if this helps
@muzzak123 oh ok no problem :)
can you please expand on what you did ?
We regenerated an empty database as it was suggested on the forum, but I can't find the post anymore, and it seems this regeneration is not required at all.
I suspect that when the file becomes too large, the PI cant update it quick enough and somehow crashes.
two things :
I've added commit_interval: 10 to my recorder config to see if this helps
Well… That's an interesting setting. Tell me if that does not cause an issue.
@Salamandar Well that was short lived. I've had 3 more crashes. in the past 2 days, so it did seem to lessen the crash rate but its still there. I'm going back to my "watchdog power point" system. Although not ideal, at least that keeps my automation alive.
Hi @muzzak123 I had a look at your log. It seems there is a lot of addons and devices, but to better understand the current status of your HASS:
Regarding https://github.com/home-assistant/core/issues/43155#issuecomment-752403137 one thing you could try is to leave a monitor connected and logged in to the main terminal (login as root, no password, then type login) monitoring the log in real time with journalctl --follow
. When the system crashes you might be able to find the last lines there (if the monitor is still up...).
Hi @CarlosGS. Thanks for your reply and I appreciate your time and help with this. Re Addons, I tried uninstalling many of them to see if it stopped the crashing issue, but it didn't seem to make any difference, so I put them back.
Addons Screen Note: Frigate NVR, Google Assistant Webserver, SSH & Web Terminal and Terminal & SSH are installed but not started
Integrations Screen
Supervisor Screen
System Screen I included this screen to show what is happening with the system at crash time. I had a crash today at about 8:30 and you can notice the flat period where there are no values for about an hour. There doesn't seem to be any spike in activity in the system prior to this crash.
The setup
Thank you, great idea to share CPU/Temp/RAM statistics prior to the crash. Additionally you could try peaking at the history tab, selecting the interval for those early minutes. Maybe the other statuses/datastreams can give you some insight, it's worth you give them a thorough look, for any devices that go unavailable, value fluctuations, etc.
Also, from the devices that you have, what's the fastest output? I'm asking because when using a Raspi I recall having an issue when 2 ESPhome sensors were feeding a sample every second. That caused HASS to go unresponsive, and delaying them solved it.
Next, I think you are currently using MariaDB, but if not, what's the current .db file size?
Finally, I know it's a hassle, but if you could share the log for the minutes before today's crash, that'd be great. And if you could set up the journalctl --follow
command mentioned earlier, that should catch any last second errors that happen before the next crash.
(And worst case/inconvenient scenario, you could try setting up a HASSOS virtual machine to see if it hangs too. Not sure that'd be too helpful. Though I did end up switching from the raspi to a VM due to these sort of problems, reboot corruptions, etc)
Hi @CarlosGS, I have setup the HDMI with journalctl --follow
so am awaiting the next crash.
I have zoomed in on the history just before the crash and didn't find anything unusual. The wind speed increased by 2kph, maybe that's it :smiley::smiley::smiley::smiley:
The fastest devices I have would be the PIRs. I have 7 of them connected to the ESP. They are set to 4 second retrigger windows, so the shortest trigger period would be 4 seconds individually, but with movement in multiple areas of the house at the same time it is possible the trigger intervals on the ESP could be less then 4 seconds. But this would only be a few sensors for a very short period and I would not expect them to overwhelm the PI.
I use MariaDB currently. I switched to that in an effort to resolve this problem. It happened on SQLite, and it still happens on MariaDB.
My recorder config is currently
recorder: auto_purge: true purge_keep_days: 3 commit_interval: 10
and the file size is at 565Mb
I just had another crash then so here is the leadup log
Crash Log.txt
I'm beginning to suspect, as it sounds like you are, that the problem is the PI4 hardware not able to cope. I have read a lot of other comments on the net of people that have found the same. I am considering rebuilding on more powerful hardware such as a NUC. Do you have any recommendations ?
Let's hope you can catch something at the display :S
I don't have too much experience on MariaDB, so can't really advice with that. 565Mb sounds like a lot for 3 days, but maybe MariaDB takes more storage than SQlite.
I do however with ESPhome. In case it is a concurrency issue, I wouldn't mind having a look at the yaml files (do mask out ssid/keys). From the previous capture it seems there are two ESPs? Also let me know the ESPhome version they're currently running.
Regarding HASS hardware recommendations, I run it on a VirtualBox VM within my computer which is on all the time anyways. It takes 1.5GB of ram and a negligible amount of CPU. However, your Raspberry setup (cooled Rpi4, 4GB, USB 3.0 SSD, eth cable) is pretty much ideal and should work.
It seems there's more people with similar issues (https://github.com/home-assistant/operating-system/issues/1201, https://github.com/home-assistant/operating-system/issues/1096, https://github.com/home-assistant/operating-system/issues/1119, https://www.reddit.com/r/homeassistant/comments/kf41se/ha_server_goes_awol_witout_reason/) No solution yet, let's continue debugging with what we can :angel:
Wow - it is somewhat reassuring to know I'm not alone. It does seem quite wide spread and the timeline on https://github.com/home-assistant/operating-system/issues/1119 seems to fit with my issue occurring around the 2020.12.0 update. I'm wondering if there is someway to consolidate all these threads so others can input to help resolve it as well ? Seems they are recommending a downgrade to HassOS 5.4 as a temporary fix. I might try that.
The ESPHome is interesting. I only have one ESP32-WROOM physical device. In HA I disabled 2 gpios that I had configured in ESPHOME but am not using and it seems the integration has split it into 2 devices. The top one contains the disabled entities and the bottom contains the enabled entities. Yaml is here gpio.txt
That's an efficient use of a single ESP32! (I have 18 ESP8266 :sweat_smile:)
Hmm, I'd probably add debouncing on every pin. But maybe don't change anything until you can catch the error on the display.
filters:
- delayed_off: 500ms
- delayed_on: 500ms
I'm very intrigued: why did you add delayed_off: 1000ms
only to one of the PIR sensors?
I'm also suffering from this bug... It was crashing every other day... But since 2 days it's crashing every hours. Like everyone else, I can not find anything useful in the logs...
Additionally, if a system becomes unreachable, one could also try to see what the observer says, which runs at
http://homeassistant.local:4357
(orhttp://<IP ADDRESS OF YOUR DEVICE>:4357
)
Tried this but ERR_CONNECTION_REFUSED... Oddly enough (like other people said), a ping still work... But everything else seems dead...
Like everyone else I'm also having the same problem. It just started after I upgraded to the version including security patches. Everyday since that date, it crashes more frequently - started from once per two days and today twice per 12h. I noticed that the RP3 box is also warm ☹️
@CarlosGS debouncing every pin is a good idea. I probably should do that. The delayed_off: 1000ms
is because one of my sensors was playing up for a while giving a short burst of multiple triggers. I used this to "debounce" it 😃. I've since fixed the PIR but never got around to changing the code. The ESP32 replaces an old home security system so all the sensors were wired back to the same place, making it easy to install the ESP32.
Onto the topic at hand. I downgraded to HASSOS 5.4 last night and the system has now been running for 9 hrs solid without missing a beat. I really think the problem is in the new versions of HASSOS
@frenck could you please just read https://community.home-assistant.io/t/better-logging/185694 ? When multiple people start saying the same thing, maybe it's simply time to start listening ... just saying :)
As for providing the logs, well ... personnally I just can't plug a screen that I don't have + RPI not easily accessible. And I don't have access to host as I run HassOS so it's either on or off, it's not like it was a VM ...
To put it simply, we're just asking if it could be possible that logs (core and supervisor) were made able to survive a restart. Log retention policy isn't really a new IT thing... Anyway, I don't mean to be abrupt, I know all this is voluntary work, but even so, I just don't see why this is becoming such a big deal.
Hi @Toukite, the historic of full logs is saved internally (journald) and can be accessed via the command line. I agree it would be nice to have an easier way to retrieve them through the graphical interface.
However, I want to ask you to please reconsider your attitude towards the project maintainers. I know that these bugs can get the better of us at times - but please remember that this is a vast open source project. It's people giving their work for free. It can take a while to resolve every single issue that occurs, annoying as that can be. Sorry for the inconvenience.
Here are the logs I have captured
Just wondering if above could destroy my two SD cards since a month 🥺...
Here are the logs I have captured
Just wondering if above could destroy ma two SD card since a month 🥺...
Same log as you... I checked the card with "badblocks" and it is fine. 0 bad block found...
EDIT: I tried a fresh install 64bit and 32bit... It all starts ok, then suddenly the errors mentioned by @igosoft are dumped endlessly... Once the error is there, homeassistant is gone... even the observer is gone...
@CarlosGS fully agreed. I was just reacting to previous frenck answers, which were more like "your need is bs I don't wanna hear it and it will never be considered" Than "okay maybe there is something to do here, we'll see if it's feasible and how to plan it. In the meantime, i can help differently etc..." Anyway, that point being made, no need to escalate :)
the historic of full logs is saved internally (journald) and can be accessed via the command line
@CarlosGS could you please give more details regarding this ? My terminal says "bash: journald: command not found" Thanks !
No problem! As I see it, in open-source projects there is always an inherent negative-feedback loop where all people ends up stressed - breaking that loop is in our hands.
Oh, I miswrote the command, it should be journalctl
:angel: (Note however: this bug is particularly annoying, cause when the system freezes for good, the last minute logs don't make it to disk. yikes!)
Thank you but I have the same error with journalctl :(
Ah - the SSH/Terminal addon can't provide access to the main system, because it runs on a docker container. The "login" command doesn't work there either, to me at least.
To get to the main terminal, the only way I know is via the OS display (if you use a Raspi, then you'd need an HDMI screen and a keyboard). At "homeassistant login" you shoud use root
, and no password. Then you'll get to the ha>
terminal. And if you type login
, now that gets you to the main terminal.
It does feel a bit like the movie "inception" :sweat_smile:
Hi, @CarlosGS I agree. From everything I have seen on my system this particular bug is not making it to disk logs due to the crash. I have rolled my system back to HASSOS 5.4 and it has been running for over 34hrs without missing a beat. From reading comments here https://github.com/home-assistant/operating-system/issues/1119 and my own experience I'm convinced the problem I'm having has been introduced by changes made to HASSOS since 5.4. Perhaps instead of trying to "catch the problem" in phantom logs that may or may not exist, may I suggest a full review of the changes made since HASSOS 5.4, focussing in particular on those that may affect a RPI ?
Oh and @Toukite you can access an RPI via the Putty program using SSH, without a HDMI screen, but obviously with another computer on the network. Putty can also log all activity so the full logs are captured. Although I used this process many times I never had any joy capturing anything meaningful at the time of the crash.
@muzzak123 following your idea, I re-installed from scratch a 5.5 using https://github.com/home-assistant/operating-system/releases/tag/5.5
So far, running for 12 hours without crashes...
OK - in case anyone is wanting to try this, you just need to rollback HASSOS to 5.4. Everything else remains the same.
You can do this by logging on as root and issuing the command ha os update --version 5.4
OK - in case anyone is wanting to try this, you just need to rollback HASSOS to 5.4. Everything else remains the same. You can do this by logging on as root and issuing the command
ha os update --version 5.4
One little note: I started by reverting to 5.4 but then it was not booting at all... Seems like it was a known issue of 5.4. See https://github.com/home-assistant/operating-system/releases/tag/5.4
Dunno - I was just following here https://github.com/home-assistant/operating-system/issues/1119#issuecomment-777667147
5.5 and above changed the Kernel. Most people with the freeze problem run with no issues on any OS version below 5.5.
Same experience here too... I attempted directly running the ha os update command above from root and after seeing the successful message appear -- it will no longer boot up on my Pi 3 B+. I guess I will reflash the MicroSD and start all over again.
Yes, a lot happened in 5.5 release and since 5.5. It looks like sw releases should be more granulated eg 5.5.1 and categorised. Mixing kernel updates, bug fixes and new features in one build , will take more time the HA community to finger point the root cause
Working fine after a few updates
Working fine after a few updates
Can you give more information? What updates did you do? And how long it is running fine?
@frenck could you please just read community.home-assistant.io/t/better-logging/185694 ? When multiple people start saying the same thing, maybe it's simply time to start listening ... just saying :)
Everything is logged. Using Linux can be hard maybe, but we use standard logging and everything is logged.
@RavD666 Closed? What did you upgrade to? And from which version? As in, at which point did these issue disappear for you?
Yeah… Recreating the sqlite fixed the issue… for some time. It crashed again.
EDIT : I increased the recorder size to 30 days, maybe that's the root of the issue.
@frenck perhaps you can also ask the contributors to https://github.com/home-assistant/operating-system/issues/1119 to help with the logs as this appears to be exactly the same problem. The latest comment from on this thread says...
I have not heard back from any of the developers on this as of yet.
@muzzak123 I'm sorry to hear people have issues, but that issue you linked is polluted with all kinds of things. There is no way that issues will reach a resolvement, as it covers all kinds of things. An issue should contain a single issue. That issue should be closed IMHO.
home assistant crashing everyday# The problem
Hass.io keeps crashing everyday,I can see it on my router. Have disconnect the power to boot it back up. Its only started doing it since the last few updates.
How can I get the logs to find the issue
Environment
Problem-relevant
configuration.yaml
Traceback/Error logs
Additional information