home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
71.13k stars 29.81k forks source link

Home Assistant becomes unavailable for 1 or 2 minutes, and comes back - several times a day. #105003

Closed TroLoos closed 8 months ago

TroLoos commented 9 months ago

The problem

Hi, recently I've realised that my Home Assistant became quite unstable.

It's going offline several times a day, and is unavailable for a minute or two. Then it comes back.

I realised that because during this "outage" window - my HomeKit is also not working (devices do not show current status), my NodeRed automations are not working (I assume NR is working fine, just HA doesn't repond), and Home Assistant GUI is also not working.

I started to monitor this via Uptime Kuma integration and here's the result:

Screenshot 2023-12-04 at 14 13 04

This monitor is checking http website availability, and red lines show when this monitor went offline.

I don't have any logs because I'm not sure where and what to look for, if you could advise I will come back with more information. I'm planning to go fresh install with backup restore to check if the issue comes back on such fresh installation.

What version of Home Assistant Core has the issue?

core-2023.11.3

What was the last working version of Home Assistant Core?

not sure

What type of installation are you running?

Home Assistant OS

Integration causing the issue

not sure

Link to integration documentation on our website

No response

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

joostlek commented 9 months ago

Can you mabe just post the logs?

TroLoos commented 9 months ago

Yes, sure - I've been waiting for this to happen again to catch the least amount of log possible.

Home Assistant timed out at: 16:31, 16:32, 16:52 and 16:53.

I'm not sure if I've been looking in the right place (log type, log level, etc.). But maybe it could be a good starting point.

home-assistant_2023-12-04T16-37-09.403Z.log

mib1185 commented 9 months ago

tbh, this log looks terrible ... many network related issues with different integrations. you should check your whole environment first. What is your exact setup (installation method, hardware/virtualized, ...)?

TroLoos commented 9 months ago

tbh, this log looks terrible ... many network related issues with different integrations. you should check your whole environment first. What is your exact setup (installation method, hardware/virtualized, ...)?

Yeah, I know... actually I kind of got used to timeouts that are network-related. I figured that at my scale it's something quite normal to have devices dropping from time to time, but I will look into it. I have many integrations running (around 47 of them) that handle a lot of devices, it's quite common that some of them times out and not only within my network but also cloud servers (like ViCare for example, they limit API access and it casues timeouts).

My setup is based on Proxmox hypervisor, with HassOS installed as a VM. My network seems to be quite fine, Unify Dream Machine PRO as a router, Unifi 24p POE as a core switch, 7 Unifi APs across the house - I would think it should be rock solid.

I tried and moved my HA VM to another Proxmox host, but it is quite the same experience. Today I will try to restore my HA from backup on a fresh HassOS VM, we'll see how it goes.

mib1185 commented 9 months ago

you might also want to disable all custom integrations (there are 16 of them) and check if the instability remains or not

TroLoos commented 9 months ago

you might also want to disable all custom integrations (there are 16 of them) and check if the instability remains or not

Yes, I will probably do that... thanks a lot for this hint.

About timeouts - what I observed is that only HA becomes unavailable in my network, and not ping-wise, but somehow connectivity to HA stops. I don't loose a single ping to my HA VM, but HA itself cannot connect for this minute or two to any other device:

Screenshot 2023-12-05 at 09 51 51

Considering above - the problem is not in my network, it's HA and disabling custom integrations as a method of problem eliminations is a very good idea, I will try to do that.

One idea that comes to my mind - perhaps I am overloading HA's WebSocket with NodeRed? I have like 1259 points of connection to HA (# of Configuration Nodes), is it possible that this overloads HA in some way?

TroLoos commented 9 months ago

Hello again, I've been searching here and there, disabled my integrations in groups overnight to check if the problem still exists, and it does indeed. Actually, I don't have 16 custom integrations, some of them are Lovelace, as for Integrations I have around 9 and even after disabling them - no change toward positive outcome. I also tried to disable my Add-Ons, and it doesn't change a thing.

I've even enable debug level of logger, overnight my log took 10GB already and looking through it - it confirms that there is a problem with communication during this outage windows. In this log, every second I have at least tens of logged events, but when outage comes, it looks like this:

https://pastebin.com/jjS89VQA

and then everything comes back to life again (the outage is from 01:08:40 till 01:09:08). Almost no activity in my log file (normal is tens or even hundreds events per second. It brings me towards assumption, that it is not a matter of integrations but rather internal network (core docker network perhaps - some event still come through). As I mentioned before - ping to HA machine from other network device works all the time.

Is there any logs I can search through to dig deeper? I would prefer not to publish full log file - first it is huge, second - it contains some sensitive network informations.

joostlek commented 8 months ago

I guess there aren't more logs to check. Your best bet would be to catch such moment and check it out, could be an internet hickup or something.

TroLoos commented 8 months ago

Thanks, I think I got it now...

I've set up brand new instance of HA, and started to move things over... looking for a change in stability of old and new. I also moved all AddOn's to dedicated Docker VM - from now on I want to keep my HA setup as simple as possible so no more 20+ addons here.

It turned out that indeed a custom integration makes it quite unstable - SolarEdge Modbus - on fresh HA install (to make it easy I installed it as a Docker container) and only this integration running, HA is behaving like my old one, hanging from time to time...:

Screenshot 2023-12-24 at 15 45 44

So I decided to leave this integration there and send the result to InfluxDB.

Here's how my new instance looks like after migrating almost everything (except for AddOn's which I moved them all to Docker VM):

Screenshot 2023-12-24 at 15 47 36

It looks almost perfect, those spikes might be moments where do some reconfiguration / backup / etc.

So we can close this issue now. It's very, very surprising for me that a simple modbus integration could cause such behaviour and it did cost me quite a lot of time... but at the end - I have a fresh HA installation, with 100% consistency in entity_id's, almost no customization in this area so should be pretty simple to set it up again if such move is required.