home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.1k stars 30.18k forks source link

ZHA becomes unavailable everytime a HA Core Update is available #116970

Closed KaosApplication closed 1 month ago

KaosApplication commented 5 months ago

The problem

Since a long time I notice that my Zigbee Home Automation regularly stops working. I have around 17 devices connected with the SkyConnect dongle, also Matter with 1 Device but this keeps working. I always saw a new Core Update available when the ZHA integration stopped working and devices went unavailable, so I thought this Core Update is essential to update to ZHA. But I now simply restart my HA to get ZHA working normally again.

What version of Home Assistant Core has the issue?

core-2024.4.0

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

ZHA

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/

Diagnostics information

config_entry-zha-27bf6429ccb3de64632943e4a3ae6d68(1).json

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

home-assistant[bot] commented 5 months ago

Hey there @dmulcahey, @adminiuga, @puddly, @thejulianjes, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `zha` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zha` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

KaosApplication commented 5 months ago

I have now activated Debug Logging for ZHA Integration. Currently HA Core Update 2024.5.2 is showing. As soon 2024.5.3 gets pushed and crashes ZHA Integration I am able to stop Debug Logging and upload it. I reckon I have to stop Debug Log BEFORE Home Assistant reboot to not loose the logs?

mediacutlet commented 4 months ago

If true this is a bizarre behavior.. but could possibly be causing my recent ZHA crashes. Here are some relevant log entries (I think). I've seen a couple of other threads open that seem related to my watchdog errors; not sure if this is the all the same root issue. I am running HA version 2024.4.4 in a docker container.

2024-05-05 06:28:54.521 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: TimeoutError() 2024-05-05 06:28:57.729 ERROR (bellows.thread_0) [bellows.uart] Lost serial connection: ConnectionResetError('Failed to transmit ASH frame after 4 retries') 2024-05-05 06:28:57.732 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart 2024-05-05 06:30:20.625 ERROR (bellows.thread_0) [bellows.uart] Lost serial connection: ConnectionResetError('Remote server closed connection') 2024-05-05 06:30:20.629 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry tcp://192.168.0.51:2003 for zha Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/config_entries.py", line 551, in async_setup result = await component.async_setup_entry(hass, self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/__init__.py", line 153, in async_setup_entry zha_gateway = await ZHAGateway.async_from_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 197, in async_from_config await instance.async_initialize() File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 215, in async_initialize await app.startup(auto_form=True) File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 233, in startup await self.connect() File "/usr/local/lib/python3.12/site-packages/bellows/zigbee/application.py", line 148, in connect await ezsp.startup_reset() File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 125, in startup_reset await self.reset() File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 151, in reset await self._gw.reset() asyncio.exceptions.CancelledError 2024-05-05 10:58:52.786 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: TimeoutError() 2024-05-05 10:58:55.993 ERROR (bellows.thread_0) [bellows.uart] Lost serial connection: ConnectionResetError('Failed to transmit ASH frame after 4 retries') 2024-05-05 10:58:55.995 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart

puddly commented 4 months ago

Watchdog failures are not a bug, they are intentional. ZHA constantly tries to communicate with your coordinator even when no devices are communicating on the network. If the coordinator stops responding, it is considered dead. The error you are seeing is saying that you coordinator has stopped responding.

mediacutlet commented 4 months ago

Thanks for that info @puddly. I am pretty new to analyzing debug logs and learning as I go. This is also my first foray into Z-Wave and Zigbee. Since this is a new install and has yet to been completely stable, I don't have enough information to know whether what I am experiencing is due to my setup or a deeper issue. The frequency that my coordinator goes offline seems to parallel what @KaosApplication described above.. but I don't have enough information to be sure. I'll probably hang in the shadows for a bit longer and monitor my environment, logs, and coordinator uptime. FWIW I am using a Homeseer Z-Net and I run HA and, everything else, in docker containers.

wim-bart commented 4 months ago

Here i am able to read teh zigbee sensors or all my plugs, but i am not able to control them anymore. It sets all devices in the desired state after booting, but when ready noting works except reading data from the devices. I can see that it works because correct values come in.

The only thing what stopped working is when HA toggles a switch on a device. And this is since upgrade tto HASOS 12.3 today.

Core 2024.5.2 Supervisor 2024.05.1 Operating System 12.3 Frontend 20240501.1

mediacutlet commented 4 months ago

Bookending my contribution to this thread; my coordinator went down again in the middle of the night but there has not been a new release of Home Assistant so my issue is unrelated. I power cycle my network every day around 4 am; the controller does not come back online (sometimes) after a network power cycle. I'll do some more logging before I open a specific thread for support. Thanks.

wim-bart commented 4 months ago

Bookending my contribution to this thread; my coordinator went down again in the middle of the night but there has not been a new release of Home Assistant so my issue is unrelated. I power cycle my network every day around 4 am; the controller does not come back online (sometimes) after a network power cycle. I'll do some more logging before I open a specific thread for support. Thanks.

I had the same issue and other issues. But with latest version of HA OS it got worse. I think there is an underlying issue with the Zigbee implementation and that most other issues are just symptoms of some deeper problems in the implementation.

When I scan the issues. I see multiple issues what can be related to each other because in the base they are traced back to zigbee, even if the integration what experiences problems is not zha. I think it is deeper in the system. And every new version of ha os, ha-core it goes from problem to worse to bad to finally break down totally.

With current state of issues the WaF (Wife acceptance Factor) gets lower and lower and here at home it gets more like “get a stable solution”. For days now my gf complained about not able to control things. First thermostat not working, now some power switches not working, automations not working due to zigbee issues.

HA should focus on a stable release and temporary stop development of new features. New features suck when base features do not work or are unstable. Also concentrate on documentation because that is the Achilles Heel of HA. If HA wants to be a product for the future, make it maintainable for users and simple for users to find out how things work. The reason to choose HA was its openness, but without a stable product, people might switch back to closed source because that is stable.

Priorities:

  1. Stable product
  2. Good documentation (with simple examples)
  3. Good connectivity with many products
  4. New features.

In my opinion a product is not ready and stable without complete documentation. If I search for “how to create a sensor” I should get an example and should be completing a task in minutes, now you need to find multiple resources and spend hours to get things done. Not good.

mediacutlet commented 4 months ago

Interesting @wim-bart. For what it's worth, when my netwwork comes back up and Zigbee does not, the intergration looks like this (below). I don't even need to restart HA in order to bring it back online. Just reload the integration. I am using HA and zwavejs-zwave-js-ui in separate containers communicating with a Z-Net over LAN.

image image

I see the watchdog / timeout Zigbee error around the time my network comes back online. Logger: bellows.zigbee.application Source: runner.py:189 First occurred: 4:53:28 AM (1 occurrences) Last logged: 4:53:28 AM

Watchdog heartbeat timeout: TimeoutError()

I considered coming up with a script that forces this integration to reload every day, but I would rather try and solve the underlying issue rather than apply a bandaid.

KaosApplication commented 4 months ago

Zigbee Integration crashed.

2024-05-10 14_43_42-Einstellungen – Home Assistant – Mozilla Firefox

Time: 10.05.2024 ~12:00 This time NO NEW HA UPDATE AVAILABLE! Debug logging deactivated and downloaded, but 374MB is to big with 25MB upload limit. Modified to just today, around 74MB, zipped to 4MB: home-assistant_zha_2024-05-10T12-42-47.344Z_MODIFIED_SINGLE_DAY.zip

also Matter device went also unreachable, but without a Error. Systemconfig of Matter here: config_entry-matter-c7e59a243abf60752b40b9a7c4b3d130.json

also Reload does NOT solve this issue for me @mediacutlet complete reboot of HA necessary. made a automation for it to restart HA via notification bar on phone.

Edit: read out my automation (5min after zigbee device not available) for exact time Ausgeführt: 10. Mai 2024 um 12:28:32 so minus 5min

edit2: here it is I guess. line 123164

2024-05-10 12:23:32.764 DEBUG (MainThread) [bellows.uart] Connection lost: ConnectionResetError('Remote server closed connection') 2024-05-10 12:23:32.766 ERROR (MainThread) [bellows.uart] Lost serial connection: ConnectionResetError('Remote server closed connection') 2024-05-10 12:23:32.766 DEBUG (MainThread) [bellows.ezsp] socket://core-silabs-multiprotocol:9999 connection lost unexpectedly: Remote server closed connection 2024-05-10 12:23:32.766 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart 2024-05-10 12:23:32.767 DEBUG (MainThread) [bellows.zigbee.application] Received _reset_controller_application frame with ("Serial connection loss: ConnectionResetError('Remote server closed connection')",) 2024-05-10 12:23:32.767 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: "Serial connection loss: ConnectionResetError('Remote server closed connection')" 2024-05-10 12:23:32.768 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: "Serial connection loss: ConnectionResetError('Remote server closed connection')" 2024-05-10 12:23:32.768 DEBUG (MainThread) [bellows.uart] Connection lost: None 2024-05-10 12:23:32.768 DEBUG (MainThread) [bellows.uart] Closed serial connection 2024-05-10 12:23:32.768 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Shutting down ZHA ControllerApplication

edit3: as also Matter device went unavailable I collected logs from Silicon Labs AddOn Silicon Labs Multiprotocol log.txt saw AddOn Watchdog was not on automatic restart. reload now works for zigbee integration after restarting addon.

wim-bart commented 4 months ago

Interesting @wim-bart. For what it's worth, when my netwwork comes back up and Zigbee does not, the intergration looks like this (below). I don't even need to restart HA in order to bring it back online. Just reload the integration. I am using HA and zwavejs-zwave-js-ui in separate containers communicating with a Z-Net over LAN.

image image

I see the watchdog / timeout Zigbee error around the time my network comes back online. Logger: bellows.zigbee.application Source: runner.py:189 First occurred: 4:53:28 AM (1 occurrences) Last logged: 4:53:28 AM

Watchdog heartbeat timeout: TimeoutError()

I considered coming up with a script that forces this integration to reload every day, but I would rather try and solve the underlying issue rather than apply a bandaid.

Sometimes reload works, but still cannot turn devices on or off. The strange thing is, when HA OS reboots (not just HA) the devices get initialized as expected, no issues at all. But from automation or frontend or even device views it does not work.

The devices can be queried, without issue, when I put it off by the manual button on the device, he button in ha toggles, but toggling it from ha does not work.

And sometimes at a couple of seconds after 4 all devices magically disappear, sometimes all, sometimes just one or two. But at random, not always the same device.

Sometimes (more not than yes) restart of zha integration helps viewing devices, sometimes reload of ha works also, but mostly I need to reboot ha os completely to get it working again.

I had issues with upgrading ha-core to the latest, but after upgrading ha os the functionality of zigbee is totally destroyed.

I wish I could go back one version without loosing my history for all my sensors, but don’t know how. Rebuild and import backup did bring me stability but lost all history and that is not acceptable for my gf.

puddly commented 4 months ago

@wim-bart @KaosApplication You're using the (experimental) multiprotocol addon:

2024-05-10 12:23:32.766 DEBUG (MainThread) [bellows.ezsp] socket://core-silabs-multiprotocol:9999 connection lost unexpectedly: Remote server closed connection

When the addon crashes, ZHA will reload. I suggest you disable it and migrate back to normal Zigbee firmware, this isn't a problem with ZHA nor one that will be fixed for the foreseeable future, as development on the multiprotocol addon has been paused: https://skyconnect.home-assistant.io/procedures/disable-multiprotocol/

KaosApplication commented 4 months ago

@puddly ah okay thanks for the heads up. but for clarification: if I disable multiprotocol I will have no connection to my matter network/device, am I correct? Cause that was one of the main reasons I went from ConBeeII to SkyConnect stick.

wim-bart commented 4 months ago

@wim-bart @KaosApplication You're using the (experimental) multiprotocol addon:

2024-05-10 12:23:32.766 DEBUG (MainThread) [bellows.ezsp] socket://core-silabs-multiprotocol:9999 connection lost unexpectedly: Remote server closed connection

When the addon crashes, ZHA will reload. I suggest you disable it and migrate back to normal Zigbee firmware, this isn't a problem with ZHA nor one that will be fixed for the foreseeable future, as development on the multiprotocol addon has been paused: https://skyconnect.home-assistant.io/procedures/disable-multiprotocol/

What experimental thing am i using?

I use native zha, with a out of the box firmware i use for months now. image

And no add-on's what so ever for Zha: image

KaosApplication commented 4 months ago

again @puddly so this means I cannot use my SkyConnect USB not for Zigbee and Matter simultaneously despite I bought and migrated from ConbeeII mainly because of this reason?

puddly commented 4 months ago

Correct. Multi-PAN is not stable for everyone. If it isn't stable for you, you can't use one radio for both.

maceddy commented 4 months ago

I do not know if there is a common ZHA thread but my Zigbee looks completely dead. Only thing working is one temperature sensor. Tried rebooting, changing USB port etc etc. My garden lights are on (LOL) and I can not control my sunscreen so it is now 26 degrees (c) in the house :-P

I have attached debug log. Could it be my Zigbee dongle is broken?? It's about a year and a half old.

home-assistant_zha_2024-05-15T15-05-49.198Z.log

I really really have no clue.. It began to degrade from the 2024.4.4 update and got worse with every release...

puddly commented 4 months ago

@maceddy If you can receive (sensors update) but cannot send (can't control bulbs), you probably have too much interference near the stick and the firmware isn't letting you send. Make sure your coordinator is on a USB extension cable and positioned away from 2.4GHz interference sources such as any USB 3.0 ports, SSDs, hard drives, power supplies, WiFi routers, and so on.

maceddy commented 4 months ago

I will try that too, however. I just found out when I powercycle the Zigbee devices (smart plugs, lights (one light was enough to trigger them all)) they work again. I am considering resetting the complete Zigbee integration and starting all over again....

issue-triage-workflows[bot] commented 1 month ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.