home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.38k stars 30.64k forks source link

ZHA reloads on timeouts #105705

Closed Cyberax closed 10 months ago

Cyberax commented 10 months ago

The problem

The recent change ( https://github.com/home-assistant/core/pull/105659 ) added ZHA reload on any error. Unfortunately, this now causes controller reinitialization after timeouts. This in turn can cause ZHA to enter a restart loop. After a couple hundred iterations of restarts, it can also completely confuse Zigbee devices, rendering the whole network inoperable.

What version of Home Assistant Core has the issue?

core-2023.12.2

What was the last working version of Home Assistant Core?

core-2023.12.1

What type of installation are you running?

Home Assistant OS

Integration causing the issue

ZHA

Link to integration documentation on our website

No response

Diagnostics information

2023-12-13 16:35:09.890 ERROR (MainThread) [bellows.uart] Lost serial connection: ConnectionResetError('Remote server closed connection')
2023-12-13 16:35:09.891 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart
2023-12-13 16:35:10.002 WARNING (MainThread) [homeassistant.helpers.dispatcher] Unable to remove unknown dispatcher <bound method GroupProbe._reprobe_g
roup of <homeassistant.components.zha.core.discovery.GroupProbe object at 0x7ff24107e810>>
2023-12-13 16:35:10.098 ERROR (MainThread) [zigpy.zcl] [0x559E:1:0xff01] Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 518, in write_attributes_safe
    res = await self.write_attributes(attributes, manufacturer=manufacturer)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 84, in wrapper
    return await RETRYABLE_REQUEST_DECORATOR(func)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/util.py", line 138, in retry
    return await func()
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/__init__.py", line 620, in write_attributes
    return await self.write_attributes_raw(attrs, manufacturer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/__init__.py", line 626, in write_attributes_raw
    result = await self._write_attributes(attrs, manufacturer=manufacturer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/quirks/__init__.py", line 321, in _write_attributes
    return await super()._write_attributes(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/__init__.py", line 377, in request
    return await self._endpoint.request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request
    return await self.device.request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request
    await self._application.request(
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 829, in request
    await self.send_packet(
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 762, in send_packet
    status, _ = await self._ezsp.sendUnicast(
                      ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'sendUnicast'

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

puddly commented 10 months ago

The linked pull request reloads the integration when the coordinator fails to start up or crashes during operation (e.g. you unplug it), you won't encounter it just because of network problems.

Can you include a complete ZHA debug log from startup, including the crash?

home-assistant[bot] commented 10 months ago

Hey there @dmulcahey, @adminiuga, @thejulianjes, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `zha` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zha` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

patdemko commented 10 months ago

Not sure if it's the exact same thing, but I'm also seeing something strange with ZHA after upgrading to 2023.12.2 last night. After upgrading I started noticing my zigbee devices bouncing between available and unavailable. Looking a bit closer it appears that ZHA was re-initializing over and over. I tried rebooting the OS and host (Home Assistant Yellow with a usb stick that has both a zigbee & z-wave radio attached) which didn't seem to work, but since it was late I left it go over night and would deal with in the morning. When I woke up the entire system was super sluggish and not working right. I'm in the process now of trying to restore a backup to get it back up and running and verify it was related to the upgrade. Unfortunately, I did not get any logs before starting the restore.

davidbb commented 10 months ago

Adding another data point (without log, unfortunately). I have a bunch of automations that alert me when ZHA devices go to unavailable state. Immediately after the update to 2023.12.2, these automations started firing repeatedly every few minutes. But when verifying, the devices were not unavailable... They must have gone into unavailable state for just an instant.

I rebooted the OS and then downgraded to 2023.12.1 and the notifications stopped.

maglat commented 10 months ago

After upgrading to 2023.12.2 mein ZHA is a completely mess! Will downgrade immediately.

Edit: Using Sky Connect. Devices (107) from all kind of brands. Hue, Ikea, Lidl, Aqara, others.. Edit2: My friend has issues with his Ikea motion sensors. Similar to the post I found on reddit. downgrade to 2023.12.1 fixed it for us.

J-o-h-n-M commented 10 months ago

same here, upgrading to 2023.12.2 or 2023.12.3, completely messed up my zha

sebastian3107 commented 10 months ago

same for me on SkyConnect Dongle.

puddly commented 10 months ago

Please upload debug logs of ZHA running and then crashing.

Ssyleric commented 10 months ago

Same here, in all my apartment the lights blink like Saturday nigh!

C0MTRAYA commented 10 months ago

hey guys... same here... just went to 2023.12.0 to 2023.12.3 and have the same issue: bounce between unavailable to the current state every few minutes, stays unavailable for 15sec or so. All ZigBee devices; using SONOFF_Zigbee_3.0_USB_Dongle_Plus, ZHA, rpi4

edit: rolled back to 2023.12.0 overnight, back to normal

Screenshot_20231216_013606_Home Assistant

rogerrem commented 10 months ago

I have the same problem after upgrading to 2023.12.2, using SkyConnect and ZHA. Works fine for a random amount of time after reboot (5-30 minutes), then all ZHA devices are unreachable and a new reboot is required over and over again.

WAF factor is suffering :p

Cyberax commented 10 months ago

Added logs. It's unclear what's going on, how can I add more debugging? home-assistant_zha_2023-12-18T03-45-04.527Z.log

Cyberax commented 10 months ago

One possible culprit: homeassistant/amd64-addon-silabs-multiprotocol:2.3.2 addon ("Silicon Labs Multiprotocol"). It looks like it has problems with framing:

[20:11:53:477628] Core : Endpoint #12: rxd data frame
[20:11:53:477641] Core : Endpoint #12: rxd data frame queued
[20:11:53:477644] Core : Endpoint #12 sent ACK: 2
[20:11:53:477653] Core : txd transmit completed
[20:11:53:477655] Core : Endpoint #12: frame transmit submitted
[20:11:53:477656] Core : Endpoint #12: txd ack
[20:11:53:477658] Core : rxd iframe with valid header checksum
[20:11:53:477681] Driver : 7 bytes left in the UART char driver
[20:11:53:485031] Driver : re-sync : Will test 58 header combination
[20:11:53:485042] Driver : re-sync : The start of the buffer is aligned with a good header
[20:11:53:487844] Core : Endpoint #12: rxd data frame
[20:11:53:487865] Core : Endpoint #12: rxd data frame queued
[20:11:53:487868] Core : Endpoint #12 sent ACK: 3
[20:11:53:487875] Core : txd transmit completed
[20:11:53:487877] Core : Endpoint #12: frame transmit submitted
[20:11:53:487879] Core : Endpoint #12: txd ack
[20:11:53:487881] Core : rxd iframe with valid header checksum
[20:11:53:487938] Driver : 7 bytes left in the UART char driver
[20:11:53:530147] Driver : re-sync : Will test 58 header combination
[20:11:53:530167] Driver : re-sync : The start of the buffer is aligned with a good header
[20:11:53:532601] Core : Endpoint #12: rxd data frame
[20:11:53:532636] Core : Endpoint #12: rxd data frame queued
[20:11:53:532641] Core : Endpoint #12 sent ACK: 4
[20:11:53:532657] Core : txd transmit completed
[20:11:53:532663] Core : Endpoint #12: frame transmit submitted
[20:11:53:532687] Core : Endpoint #12: txd ack
[20:11:53:532691] Core : rxd iframe with valid header checksum
[20:11:53:532716] Driver : 7 bytes left in the UART char driver
[20:11:55:086598] Driver : re-sync : Will test 58 header combination
[20:11:55:086619] Driver : re-sync : The start of the buffer is aligned with a good header
[20:11:55:089982] Core : Endpoint #12: rxd data frame
[20:11:55:090031] Core : Endpoint #12: rxd data frame queued
[20:11:55:090041] Core : Endpoint #12 sent ACK: 5
[20:11:55:090070] Core : txd transmit completed
[20:11:55:090077] Core : Endpoint #12: frame transmit submitted
[20:11:55:090082] Core : Endpoint #12: txd ack
[20:11:55:090090] Core : rxd iframe with valid header checksum
[20:11:55:090137] Driver : 7 bytes left in the UART char driver

For a symptomatic fix, I disabled the "OpenThread Border Router" in the https://homecontrol.<me>/hassio/addon/core_silabs_multiprotocol/config settings, and all the Matter-associated devices. It's been holding for the last 2 hours so far.

puddly commented 10 months ago

@Cyberax Your problem is because either the multiprotocol addon has crashed, or the firmware on your device is unresponsive. See https://github.com/home-assistant/addons/issues/3192#issuecomment-1856354160 for a potential fix for that problem.

ZHA will endlessly try to reconnect but if the addon is unresponsive (as in your log), there's nothing more that ZHA can do.

Cyberax commented 10 months ago

Yes, I'm pretty sure it's a timing issue in the SiLabs driver somewhere. Something in ZHA is just triggering it more often.

For other people on the thread, I can greatly alleviate the issue by turning on the option "Co-Processor Communication tracing" in https://home.my.domain/hassio/addon/core_silabs_multiprotocol/config and disabling all the Thread/Matter stuff (deactivating the integations).

It's obviously a workaround, not a fix. But it might get your network working again for now.

maglat commented 10 months ago

Yes, I'm pretty sure it's a timing issue in the SiLabs driver somewhere. Something in ZHA is just triggering it more often.

For other people on the thread, I can greatly alleviate the issue by turning on the option "Co-Processor Communication tracing" in https://home.my.domain/hassio/addon/core_silabs_multiprotocol/config and disabling all the Thread/Matter stuff (deactivating the integations).

It's obviously a workaround, not a fix. But it might get your network working again for now.

I will give it a try, but at the end, who gone need to fix what? The workaround is no long term solution and it cant be that we now need to disabled stuff for a new „feature“ introduced by puddly which breaks everything when it worked flawlessly bevor.

maglat commented 10 months ago

The silabs multiprotocol cant be the issue, after I don't not have it installed.

Edit: Even Multiprotocol support in Skyconnect settings are disabled at my side.

puddly commented 10 months ago

@maglat As mentioned above, please post debug logs. You don't have the same issue as @Cyberax, who originally opened the issue.

davidbb commented 10 months ago

I also don't have silabs multiprotol enabled. Here are my logs from this morning after upgrading once again from 2023.12.0 to 2023.12.3. ZHA restarts begin immediately, looks like every 5 mins approximately. The attached logs capture 3 or 4 restart instances.

Screenshot 2023-12-18 at 9 24 49 AM

home-assistant_zha_2023-12-18T14-12-56.453Z.log.gz

Ssyleric commented 10 months ago

hey guys... same here... just went to 2023.12.0 to 2023.12.3 and have the same issue: bounce between unavailable to the current state every few minutes, stays unavailable for 15sec or so. All ZigBee devices; using SONOFF_Zigbee_3.0_USB_Dongle_Plus, ZHA, rpi4

edit: rolled back to 2023.12.0 overnight, back to normal

Screenshot_20231216_013606_Home Assistant

I go back to the previous version, 2023.12.0 ! now, i deactivated the automated updates and waiting for the next bug fixe release.

davidbb commented 10 months ago

@puddly in your opinion is the symptom of "state change between available/unavailable since 2023.12.2" (as reported by myself, @Ssyleric, @C0MTRAYA, and @patdemko in this issue) the same/related to the issue described in the opening post?

If not maybe it makes sense to open a new issue. Please let us know. I'd be happy to provide fresh logs as this issue is easily reproducible by upgrading my install from 2023.12.1 to either 2023.12.2 or 2023.12.3. Thanks!

asayler commented 10 months ago

I commented on #105734 (which should maybe be marked as a dup of this), but seeing very similar behavior on my HA Yellow since at least the 2023.12.2 update (it's not clear if it was present in 2023.12.1). I'm using the Yellow's built-in Zigbee radio without the multiprotocol firmware (Zigbee-only).

Debug logs: home-assistant_2023-12-20T06-04-22.736Z.log

C0MTRAYA commented 10 months ago

hey guys... same here... just went to 2023.12.0 to 2023.12.3 and have the same issue: bounce between unavailable to the current state every few minutes, stays unavailable for 15sec or so. All ZigBee devices; using SONOFF_Zigbee_3.0_USB_Dongle_Plus, ZHA, rpi4 edit: rolled back to 2023.12.0 overnight, back to normal Screenshot_20231216_013606_Home Assistant

I go back to the previous version, 2023.12.0 ! now, i deactivated the automated updates and waiting for the next bug fixe release.

yup! good thing to disable auto updates, I only update when I actually have the time to troubleshoot. when I did this it was Friday night so I had time to try other thing and troubleshoot before ultimatly rolling back. (the number of times I updated just before going to work ending up worrying all day because I couldn't start troubleshooting untill after my shift lol) I'm sure it will be patched soon though, from experience I know the devs are doing a great job

patdemko commented 10 months ago

I just tried upgrading to 2023.12.4 since I saw it included #106147 which was mentioned by @puddly in the hopes it would fix the issue I was seeing. However, I'm still seeing the same issue as I reported before (zigbee devices constantly going from available to unavailable due to ZHA constantly resetting itself). Although this time I did enable debug mode on ZHA and get logs before I restored my older backup. To save space, I filtered my log only lines containing "zha", but I have the full log available if needed as well. Also @puddly since my issue may be different than the original one from this thread, let me know if creating a new issue would be better here. Since there were a few already open and a number of people have already commented in this one I didn't want to add another, but will do so if you want.

System Log zha.code-search.log

Logbook screenshot showing one of my zigbee devices going from available to unavailable which starting after upgrading from 2023.12.1 to 2023.12.4.

kithcen_motion_unavailable

puddly commented 10 months ago

@patdemko Unfortunately, filtering by ZHA will exclude everything useful. Try compressing the log. If that isn't enough, you can email it (or a link to it) to me at puddly3@gmail.com.

patdemko commented 10 months ago

@patdemko Unfortunately, filtering by ZHA will exclude everything useful. Try compressing the log. If that isn't enough, you can email it (or a link to it) to me at puddly3@gmail.com.

Sorry didn't realized. Compressed it isn't bad. Here you go.

home-assistant_2023-12-27T15-28-48.304Z_compressed.zip

davidbb commented 10 months ago

Updating to 2023.12.4 does not fix the issue for me. I don't think #106147 fixed the root cause. Reverting to 2023.12.1 makes everything work smoothly, so it's definitely something introduced in 2023.12.3 (or maybe 2023.12.2). I'm on a Home Assistant Yellow using the built-in zigbee module.

puddly commented 10 months ago

@patdemko Your log is really helpful, thank you. I think we're going in the right direction with it.

Could you enable Core debug logging in your configuration.yaml file and restart HA? It will be very verbose, HA will run a little slower, and the log will grow very quickly. Leave it running for however long it takes for you to reproduce the issue a few times, an hour would be ideal.

logger:
  default: debug

It will be too large and have too much sensitive info for you to upload to GitHub so send a link to it over email to puddly3@gmail.com and I will take a look.

Thanks!

patdemko commented 10 months ago

@patdemko Your log is really helpful, thank you. I think we're going in the right direction with it.

Could you enable Core debug logging in your configuration.yaml file and restart HA? It will be very verbose, HA will run a little slower, and the log will grow very quickly. Leave it running for however long it takes for you to reproduce the issue a few times, an hour would be ideal.

logger:
  default: debug

It will be too large and have too much sensitive info for you to upload to GitHub so send a link to it over email to puddly3@gmail.com and I will take a look.

Thanks!

@puddly sorry for delay, but yesterday before seeing this message I had already decided to migrate my z-wave and zigbee devices away from the usb stick with dual z-wave and zigbee radios plugged into my yellow to the included zigbee chip on the yellow and a 800LR Z-Wave GPIO module which I have been planning to do for a while. I completed that last night and just a bit ago tried what you were requesting to see if it would still happen on different radios.

So I do think I duplicated it again after a while, but I think the full debugging you had me turn on was too much for the yellow and before I could download the logs HA restarted itself on me (assuming from the debug mode being on). After it came back up, I downloaded the logs (152MB generated over 5 minutes) but it appears by the timestamps it was only from the time of the reboot and not before when I saw the issue. I'll send you a link to that log file in a minute, but I'm not sure the issue was reproduced in that 5 minute time frame. Is there anyway to be more specific with which debug logs you need if we need to try this again so to hopefully not overwhelm the yellow?

TekFan commented 10 months ago

I had to update from 12.1 to 12.4 two days ago, expecting to fix another ZHA issue. But since then, I experience a similar problem as the others in this thread, a random ZHA reload. In my case it's only about once a day though, so it is very difficult to catch that in a log since it would need to be permanently in ZHA debug mode which would explode the file. But at least in the regular log, I get this at the same moment the restart occurs:


Logger: zigpy_deconz.uart
Source: /usr/local/lib/python3.11/site-packages/zigpy_deconz/uart.py:32
First occurred: December 27, 2023 at 16:13:37 (3 occurrences)
Last logged: 00:03:50

Lost connection: SerialException('device reports readiness to read but returned no data (device disconnected or multiple access on port?)')
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/serial_asyncio_fast/__init__.py", line 132, in _read_ready
    data = self._serial.read(self._max_read_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/serial/serialposix.py", line 595, in read
    raise SerialException(
serial.serialutil.SerialException: device reports readiness to read but returned no data (device disconnected or multiple access on port?)

But I couldn't identify if this is a cause or a consequence of the reload of ZHA. Going back to 12.1 obviously fixes the problem as well as eliminates a bunch of other new ZHA warnings I never had before, flooding the log. As far as I remember, I never experienced a ZHA spontaneous reload until 2023.12.x

One thing I do know is that this bug is very problematic in my case as it does trigger many automations (events are fired at ZHA reload) including some TTS on my HomePods in the middle of the night. Very low WAF...

harvindhillon commented 10 months ago

Same issue here, doesnt happen very often maybe 2 or 3 times a day but enough to cause devices to drop off the network. Network was rock solid for the past year with no drop off devices but big issues in december.

Can clearly see in the devices tab that all devices went offline x hours ago but there was no restart.

teranex commented 10 months ago

I had a similar experience. I think it started when I upgraded from 2023.12.1 to 2023.12.3. All devices lost their connection a good number of times a day for 15 or so seconds, causing the network to sometimes not fully recover. I downgraded to 2023.12.1 again but still saw the same behaviour. Then I downgraded to 2023.12.0 and then only saw it 2 or 3 times a day for 15 seconds. 4 days ago I moved my HA install from a RPi3 to a RPi4 (by restoring a backup and reusing the exact same coordinator (Sonoff Dongle-E)). Since then I didn't notice this behaviour anymore. Yesterday I upgraded to 2023.12.4 and everything still runs smoothly.

sebastian3107 commented 10 months ago

I can confirm that it still happens on 2023.12.4. Unfortunately, for me this is the last nail in the coffin for ZHA, at least for now. I had some stability problems before and will be migrating my 60 devices over to Z2M and probably also moving away from the Skyconnect dongle...

dmulcahey commented 10 months ago

Please try the current beta if you are having issues

davidbb commented 10 months ago

I just installed 2024.1.0b3 and I think things are stable! Been running for 15 mins with no devices going unavailable. Things seem responsive. Thanks @dmulcahey @puddly and everyone else who investigated this.

dmulcahey commented 10 months ago

Great to hear it! We did find an issue with Zigbee groups that will be fixed in the next release.

rogerrem commented 10 months ago

Sorry, I’ve never used beta ZHA, How do I install the beta 🙈

TheJulianJES commented 10 months ago

It's the beta Home Assistant Core version you need to install: https://www.home-assistant.io/common-tasks/os/#run-a-beta-version

rogerrem commented 10 months ago

It's the beta Home Assistant Core version you need to install: https://www.home-assistant.io/common-tasks/os/#run-a-beta-version

Thanks! That makes sense, I was looking for a separate ZHA beta 😅😂 Things are stable after 2024.01 for me 👍

J-o-h-n-M commented 10 months ago

Same for me, I upgraded to 2024.1 (from 2023.12.1) and all seems to be working fine.

dmulcahey commented 10 months ago

The 2024.1 release had a bunch of fixes in it that should address stability. I am going to close this. If folks still have issues please open a new issue and provide full debug logs and completely fill out the issue template. Also, please don't piggy back other issues that are open unless the issue, error messages if there are any, and the hardware configuration for the coordinator are EXACTLY the same. Thanks!!

C0MTRAYA commented 10 months ago

running 2024.1.0 for a few days, I can comfirm everything is working now! thanks to the devs for the great work!