Closed Cyberax closed 10 months ago
The linked pull request reloads the integration when the coordinator fails to start up or crashes during operation (e.g. you unplug it), you won't encounter it just because of network problems.
Can you include a complete ZHA debug log from startup, including the crash?
Hey there @dmulcahey, @adminiuga, @thejulianjes, mind taking a look at this issue as it has been labeled with an integration (zha
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
zha documentation zha source (message by IssueLinks)
Not sure if it's the exact same thing, but I'm also seeing something strange with ZHA after upgrading to 2023.12.2 last night. After upgrading I started noticing my zigbee devices bouncing between available and unavailable. Looking a bit closer it appears that ZHA was re-initializing over and over. I tried rebooting the OS and host (Home Assistant Yellow with a usb stick that has both a zigbee & z-wave radio attached) which didn't seem to work, but since it was late I left it go over night and would deal with in the morning. When I woke up the entire system was super sluggish and not working right. I'm in the process now of trying to restore a backup to get it back up and running and verify it was related to the upgrade. Unfortunately, I did not get any logs before starting the restore.
Adding another data point (without log, unfortunately). I have a bunch of automations that alert me when ZHA devices go to unavailable state. Immediately after the update to 2023.12.2, these automations started firing repeatedly every few minutes. But when verifying, the devices were not unavailable... They must have gone into unavailable state for just an instant.
I rebooted the OS and then downgraded to 2023.12.1 and the notifications stopped.
After upgrading to 2023.12.2 mein ZHA is a completely mess! Will downgrade immediately.
Edit: Using Sky Connect. Devices (107) from all kind of brands. Hue, Ikea, Lidl, Aqara, others.. Edit2: My friend has issues with his Ikea motion sensors. Similar to the post I found on reddit. downgrade to 2023.12.1 fixed it for us.
same here, upgrading to 2023.12.2 or 2023.12.3, completely messed up my zha
same for me on SkyConnect Dongle.
Please upload debug logs of ZHA running and then crashing.
Same here, in all my apartment the lights blink like Saturday nigh!
hey guys... same here... just went to 2023.12.0 to 2023.12.3 and have the same issue: bounce between unavailable to the current state every few minutes, stays unavailable for 15sec or so. All ZigBee devices; using SONOFF_Zigbee_3.0_USB_Dongle_Plus, ZHA, rpi4
edit: rolled back to 2023.12.0 overnight, back to normal
I have the same problem after upgrading to 2023.12.2, using SkyConnect and ZHA. Works fine for a random amount of time after reboot (5-30 minutes), then all ZHA devices are unreachable and a new reboot is required over and over again.
WAF factor is suffering :p
Added logs. It's unclear what's going on, how can I add more debugging? home-assistant_zha_2023-12-18T03-45-04.527Z.log
One possible culprit: homeassistant/amd64-addon-silabs-multiprotocol:2.3.2 addon ("Silicon Labs Multiprotocol"). It looks like it has problems with framing:
[20:11:53:477628] Core : Endpoint #12: rxd data frame
[20:11:53:477641] Core : Endpoint #12: rxd data frame queued
[20:11:53:477644] Core : Endpoint #12 sent ACK: 2
[20:11:53:477653] Core : txd transmit completed
[20:11:53:477655] Core : Endpoint #12: frame transmit submitted
[20:11:53:477656] Core : Endpoint #12: txd ack
[20:11:53:477658] Core : rxd iframe with valid header checksum
[20:11:53:477681] Driver : 7 bytes left in the UART char driver
[20:11:53:485031] Driver : re-sync : Will test 58 header combination
[20:11:53:485042] Driver : re-sync : The start of the buffer is aligned with a good header
[20:11:53:487844] Core : Endpoint #12: rxd data frame
[20:11:53:487865] Core : Endpoint #12: rxd data frame queued
[20:11:53:487868] Core : Endpoint #12 sent ACK: 3
[20:11:53:487875] Core : txd transmit completed
[20:11:53:487877] Core : Endpoint #12: frame transmit submitted
[20:11:53:487879] Core : Endpoint #12: txd ack
[20:11:53:487881] Core : rxd iframe with valid header checksum
[20:11:53:487938] Driver : 7 bytes left in the UART char driver
[20:11:53:530147] Driver : re-sync : Will test 58 header combination
[20:11:53:530167] Driver : re-sync : The start of the buffer is aligned with a good header
[20:11:53:532601] Core : Endpoint #12: rxd data frame
[20:11:53:532636] Core : Endpoint #12: rxd data frame queued
[20:11:53:532641] Core : Endpoint #12 sent ACK: 4
[20:11:53:532657] Core : txd transmit completed
[20:11:53:532663] Core : Endpoint #12: frame transmit submitted
[20:11:53:532687] Core : Endpoint #12: txd ack
[20:11:53:532691] Core : rxd iframe with valid header checksum
[20:11:53:532716] Driver : 7 bytes left in the UART char driver
[20:11:55:086598] Driver : re-sync : Will test 58 header combination
[20:11:55:086619] Driver : re-sync : The start of the buffer is aligned with a good header
[20:11:55:089982] Core : Endpoint #12: rxd data frame
[20:11:55:090031] Core : Endpoint #12: rxd data frame queued
[20:11:55:090041] Core : Endpoint #12 sent ACK: 5
[20:11:55:090070] Core : txd transmit completed
[20:11:55:090077] Core : Endpoint #12: frame transmit submitted
[20:11:55:090082] Core : Endpoint #12: txd ack
[20:11:55:090090] Core : rxd iframe with valid header checksum
[20:11:55:090137] Driver : 7 bytes left in the UART char driver
For a symptomatic fix, I disabled the "OpenThread Border Router" in the https://homecontrol.<me>/hassio/addon/core_silabs_multiprotocol/config
settings, and all the Matter-associated devices. It's been holding for the last 2 hours so far.
@Cyberax Your problem is because either the multiprotocol addon has crashed, or the firmware on your device is unresponsive. See https://github.com/home-assistant/addons/issues/3192#issuecomment-1856354160 for a potential fix for that problem.
ZHA will endlessly try to reconnect but if the addon is unresponsive (as in your log), there's nothing more that ZHA can do.
Yes, I'm pretty sure it's a timing issue in the SiLabs driver somewhere. Something in ZHA is just triggering it more often.
For other people on the thread, I can greatly alleviate the issue by turning on the option "Co-Processor Communication tracing" in https://home.my.domain/hassio/addon/core_silabs_multiprotocol/config and disabling all the Thread/Matter stuff (deactivating the integations).
It's obviously a workaround, not a fix. But it might get your network working again for now.
Yes, I'm pretty sure it's a timing issue in the SiLabs driver somewhere. Something in ZHA is just triggering it more often.
For other people on the thread, I can greatly alleviate the issue by turning on the option "Co-Processor Communication tracing" in https://home.my.domain/hassio/addon/core_silabs_multiprotocol/config and disabling all the Thread/Matter stuff (deactivating the integations).
It's obviously a workaround, not a fix. But it might get your network working again for now.
I will give it a try, but at the end, who gone need to fix what? The workaround is no long term solution and it cant be that we now need to disabled stuff for a new „feature“ introduced by puddly which breaks everything when it worked flawlessly bevor.
The silabs multiprotocol cant be the issue, after I don't not have it installed.
Edit: Even Multiprotocol support in Skyconnect settings are disabled at my side.
@maglat As mentioned above, please post debug logs. You don't have the same issue as @Cyberax, who originally opened the issue.
I also don't have silabs multiprotol enabled. Here are my logs from this morning after upgrading once again from 2023.12.0 to 2023.12.3. ZHA restarts begin immediately, looks like every 5 mins approximately. The attached logs capture 3 or 4 restart instances.
home-assistant_zha_2023-12-18T14-12-56.453Z.log.gz
hey guys... same here... just went to 2023.12.0 to 2023.12.3 and have the same issue: bounce between unavailable to the current state every few minutes, stays unavailable for 15sec or so. All ZigBee devices; using SONOFF_Zigbee_3.0_USB_Dongle_Plus, ZHA, rpi4
edit: rolled back to 2023.12.0 overnight, back to normal
I go back to the previous version, 2023.12.0 ! now, i deactivated the automated updates and waiting for the next bug fixe release.
@puddly in your opinion is the symptom of "state change between available/unavailable since 2023.12.2" (as reported by myself, @Ssyleric, @C0MTRAYA, and @patdemko in this issue) the same/related to the issue described in the opening post?
If not maybe it makes sense to open a new issue. Please let us know. I'd be happy to provide fresh logs as this issue is easily reproducible by upgrading my install from 2023.12.1 to either 2023.12.2 or 2023.12.3. Thanks!
I commented on #105734 (which should maybe be marked as a dup of this), but seeing very similar behavior on my HA Yellow since at least the 2023.12.2 update (it's not clear if it was present in 2023.12.1). I'm using the Yellow's built-in Zigbee radio without the multiprotocol firmware (Zigbee-only).
Debug logs: home-assistant_2023-12-20T06-04-22.736Z.log
hey guys... same here... just went to 2023.12.0 to 2023.12.3 and have the same issue: bounce between unavailable to the current state every few minutes, stays unavailable for 15sec or so. All ZigBee devices; using SONOFF_Zigbee_3.0_USB_Dongle_Plus, ZHA, rpi4 edit: rolled back to 2023.12.0 overnight, back to normal
I go back to the previous version, 2023.12.0 ! now, i deactivated the automated updates and waiting for the next bug fixe release.
yup! good thing to disable auto updates, I only update when I actually have the time to troubleshoot. when I did this it was Friday night so I had time to try other thing and troubleshoot before ultimatly rolling back. (the number of times I updated just before going to work ending up worrying all day because I couldn't start troubleshooting untill after my shift lol) I'm sure it will be patched soon though, from experience I know the devs are doing a great job
I just tried upgrading to 2023.12.4 since I saw it included #106147 which was mentioned by @puddly in the hopes it would fix the issue I was seeing. However, I'm still seeing the same issue as I reported before (zigbee devices constantly going from available to unavailable due to ZHA constantly resetting itself). Although this time I did enable debug mode on ZHA and get logs before I restored my older backup. To save space, I filtered my log only lines containing "zha", but I have the full log available if needed as well. Also @puddly since my issue may be different than the original one from this thread, let me know if creating a new issue would be better here. Since there were a few already open and a number of people have already commented in this one I didn't want to add another, but will do so if you want.
System Log zha.code-search.log
Logbook screenshot showing one of my zigbee devices going from available to unavailable which starting after upgrading from 2023.12.1 to 2023.12.4.
@patdemko Unfortunately, filtering by ZHA
will exclude everything useful. Try compressing the log. If that isn't enough, you can email it (or a link to it) to me at puddly3@gmail.com
.
@patdemko Unfortunately, filtering by
ZHA
will exclude everything useful. Try compressing the log. If that isn't enough, you can email it (or a link to it) to me atpuddly3@gmail.com
.
Sorry didn't realized. Compressed it isn't bad. Here you go.
Updating to 2023.12.4 does not fix the issue for me. I don't think #106147 fixed the root cause. Reverting to 2023.12.1 makes everything work smoothly, so it's definitely something introduced in 2023.12.3 (or maybe 2023.12.2). I'm on a Home Assistant Yellow using the built-in zigbee module.
@patdemko Your log is really helpful, thank you. I think we're going in the right direction with it.
Could you enable Core debug logging in your configuration.yaml
file and restart HA? It will be very verbose, HA will run a little slower, and the log will grow very quickly. Leave it running for however long it takes for you to reproduce the issue a few times, an hour would be ideal.
logger:
default: debug
It will be too large and have too much sensitive info for you to upload to GitHub so send a link to it over email to puddly3@gmail.com
and I will take a look.
Thanks!
@patdemko Your log is really helpful, thank you. I think we're going in the right direction with it.
Could you enable Core debug logging in your
configuration.yaml
file and restart HA? It will be very verbose, HA will run a little slower, and the log will grow very quickly. Leave it running for however long it takes for you to reproduce the issue a few times, an hour would be ideal.logger: default: debug
It will be too large and have too much sensitive info for you to upload to GitHub so send a link to it over email to
puddly3@gmail.com
and I will take a look.Thanks!
@puddly sorry for delay, but yesterday before seeing this message I had already decided to migrate my z-wave and zigbee devices away from the usb stick with dual z-wave and zigbee radios plugged into my yellow to the included zigbee chip on the yellow and a 800LR Z-Wave GPIO module which I have been planning to do for a while. I completed that last night and just a bit ago tried what you were requesting to see if it would still happen on different radios.
So I do think I duplicated it again after a while, but I think the full debugging you had me turn on was too much for the yellow and before I could download the logs HA restarted itself on me (assuming from the debug mode being on). After it came back up, I downloaded the logs (152MB generated over 5 minutes) but it appears by the timestamps it was only from the time of the reboot and not before when I saw the issue. I'll send you a link to that log file in a minute, but I'm not sure the issue was reproduced in that 5 minute time frame. Is there anyway to be more specific with which debug logs you need if we need to try this again so to hopefully not overwhelm the yellow?
I had to update from 12.1 to 12.4 two days ago, expecting to fix another ZHA issue. But since then, I experience a similar problem as the others in this thread, a random ZHA reload. In my case it's only about once a day though, so it is very difficult to catch that in a log since it would need to be permanently in ZHA debug mode which would explode the file. But at least in the regular log, I get this at the same moment the restart occurs:
Logger: zigpy_deconz.uart
Source: /usr/local/lib/python3.11/site-packages/zigpy_deconz/uart.py:32
First occurred: December 27, 2023 at 16:13:37 (3 occurrences)
Last logged: 00:03:50
Lost connection: SerialException('device reports readiness to read but returned no data (device disconnected or multiple access on port?)')
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/serial_asyncio_fast/__init__.py", line 132, in _read_ready
data = self._serial.read(self._max_read_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/serial/serialposix.py", line 595, in read
raise SerialException(
serial.serialutil.SerialException: device reports readiness to read but returned no data (device disconnected or multiple access on port?)
But I couldn't identify if this is a cause or a consequence of the reload of ZHA. Going back to 12.1 obviously fixes the problem as well as eliminates a bunch of other new ZHA warnings I never had before, flooding the log. As far as I remember, I never experienced a ZHA spontaneous reload until 2023.12.x
One thing I do know is that this bug is very problematic in my case as it does trigger many automations (events are fired at ZHA reload) including some TTS on my HomePods in the middle of the night. Very low WAF...
Same issue here, doesnt happen very often maybe 2 or 3 times a day but enough to cause devices to drop off the network. Network was rock solid for the past year with no drop off devices but big issues in december.
Can clearly see in the devices tab that all devices went offline x hours ago but there was no restart.
I had a similar experience. I think it started when I upgraded from 2023.12.1 to 2023.12.3. All devices lost their connection a good number of times a day for 15 or so seconds, causing the network to sometimes not fully recover. I downgraded to 2023.12.1 again but still saw the same behaviour. Then I downgraded to 2023.12.0 and then only saw it 2 or 3 times a day for 15 seconds. 4 days ago I moved my HA install from a RPi3 to a RPi4 (by restoring a backup and reusing the exact same coordinator (Sonoff Dongle-E)). Since then I didn't notice this behaviour anymore. Yesterday I upgraded to 2023.12.4 and everything still runs smoothly.
I can confirm that it still happens on 2023.12.4. Unfortunately, for me this is the last nail in the coffin for ZHA, at least for now. I had some stability problems before and will be migrating my 60 devices over to Z2M and probably also moving away from the Skyconnect dongle...
Please try the current beta if you are having issues
I just installed 2024.1.0b3 and I think things are stable! Been running for 15 mins with no devices going unavailable. Things seem responsive. Thanks @dmulcahey @puddly and everyone else who investigated this.
Great to hear it! We did find an issue with Zigbee groups that will be fixed in the next release.
Sorry, I’ve never used beta ZHA, How do I install the beta 🙈
It's the beta Home Assistant Core version you need to install: https://www.home-assistant.io/common-tasks/os/#run-a-beta-version
It's the beta Home Assistant Core version you need to install: https://www.home-assistant.io/common-tasks/os/#run-a-beta-version
Thanks! That makes sense, I was looking for a separate ZHA beta 😅😂 Things are stable after 2024.01 for me 👍
Same for me, I upgraded to 2024.1 (from 2023.12.1) and all seems to be working fine.
The 2024.1 release had a bunch of fixes in it that should address stability. I am going to close this. If folks still have issues please open a new issue and provide full debug logs and completely fill out the issue template. Also, please don't piggy back other issues that are open unless the issue, error messages if there are any, and the hardware configuration for the coordinator are EXACTLY the same. Thanks!!
running 2024.1.0 for a few days, I can comfirm everything is working now! thanks to the devs for the great work!
The problem
The recent change ( https://github.com/home-assistant/core/pull/105659 ) added ZHA reload on any error. Unfortunately, this now causes controller reinitialization after timeouts. This in turn can cause ZHA to enter a restart loop. After a couple hundred iterations of restarts, it can also completely confuse Zigbee devices, rendering the whole network inoperable.
What version of Home Assistant Core has the issue?
core-2023.12.2
What was the last working version of Home Assistant Core?
core-2023.12.1
What type of installation are you running?
Home Assistant OS
Integration causing the issue
ZHA
Link to integration documentation on our website
No response
Diagnostics information
Example YAML snippet
No response
Anything in the logs that might be useful for us?
No response
Additional information
No response