home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.48k stars 30.7k forks source link

ZHA - None of my zigbee devices is working anymore #42997

Closed Jens-Wymeersch closed 3 years ago

Jens-Wymeersch commented 3 years ago

The problem

After a few days of stability, I've added yesterday additional Zigbee devices (HUE, IKEA) bulbs. After a while I've noticed that some of the devices didn't work anymore and restarted HA. After that I wasn't able to get any of the devices online anymore. I'm using a Sonoff coordinator which is located next to an AP (with a Signal 99% (-37 dBm))

Environment

Problem-relevant configuration.yaml

Traceback/Error logs


Logger: zigpy.zcl
Source: /usr/local/lib/python3.8/site-packages/zigpy/zcl/__init__.py:110
First occurred: 00:39:58 (1138 occurrences)
Last logged: 08:41:20

    [0x0974:1:0x0000] Unknown cluster-specific command 10
    [0xe6c7:1:0x0000] Unknown cluster-specific command 10
    [0x8789:1:0x0000] Unknown cluster-specific command 10
    [0x1ff3:1:0x0000] Unknown cluster-specific command 10
    [0x7637:1:0x0000] Unknown cluster-specific command 10

Logger: bellows.zigbee.application
Source: /usr/local/lib/python3.8/site-packages/bellows/zigbee/application.py:642
First occurred: 00:41:46 (234 occurrences)
Last logged: 08:40:33
Watchdog heartbeat timeout: 

Logger: homeassistant
Source: /usr/src/homeassistant/homeassistant/runner.py:115
First occurred: 04:48:42 (3 occurrences)
Last logged: 06:30:25
Error doing job: Exception in callback ThreadsafeProxy.__getattr__.<locals>.func_wrapper.<locals>.check_result_wrapper() at /usr/local/lib/python3.8/site-packages/bellows/thread.py:97

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.8/site-packages/bellows/thread.py", line 98, in check_result_wrapper
    result = call()
  File "/usr/local/lib/python3.8/site-packages/bellows/ezsp/__init__.py", line 230, in frame_received
    self._protocol(data)
  File "/usr/local/lib/python3.8/site-packages/bellows/ezsp/protocol.py", line 101, in __call__
    frame_name = self.COMMANDS_BY_ID[frame_id][0]
KeyError: 255

Logger: homeassistant.components.websocket_api.http.connection.139707610881760
Source: components/websocket_api/connection.py:126
Integration: Home Assistant WebSocket API (documentation, issues)
First occurred: 02:20:27 (1 occurrences)
Last logged: 02:20:27
Error handling message: Timeout 

Logger: homeassistant.components.websocket_api.http.connection.139707569744384
Source: components/zha/api.py:230
Integration: Home Assistant WebSocket API (documentation, issues)
First occurred: 00:58:32 (1 occurrences)
Last logged: 00:58:32
Error handling message: Unknown error

AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/websocket_api/decorators.py", line 18, in _handle_async_response
    await func(hass, connection, msg)
  File "/usr/src/homeassistant/homeassistant/components/zha/api.py", line 230, in websocket_permit_devices
    await zha_gateway.application_controller.permit(time_s=duration, node=ieee)
  File "/usr/local/lib/python3.8/site-packages/bellows/zigbee/application.py", line 523, in permit
    await super().permit(time_s, node)
  File "/usr/local/lib/python3.8/site-packages/zigpy/application.py", line 324, in permit
    await zigpy.zdo.broadcast(
  File "/usr/local/lib/python3.8/site-packages/zigpy/device.py", line 373, in broadcast
    result = await app.broadcast(
  File "/usr/local/lib/python3.8/site-packages/bellows/zigbee/application.py", line 606, in broadcast
    with self._pending.new(message_tag) as req:
  File "/usr/local/lib/python3.8/site-packages/zigpy/util.py", line 262, in new
    raise ControllerException(f"duplicate {sequence} TSN") from AssertionError
zigpy.exceptions.ControllerException: duplicate 215 TSN

Logger: homeassistant.components.websocket_api.http.connection.139707612834736
Source: components/websocket_api/connection.py:84
Integration: Home Assistant WebSocket API (documentation, issues)
First occurred: 00:39:59 (1 occurrences)
Last logged: 00:39:59
Received invalid command: zha/devices/permit 

Additional information

I've been seeing a lot of instability on my ZHA network. When I look to my ZigZag card, I can see my zigbee network with all its connections. That said, all my devices are greyed out. Since the problem, I've activated the debugging. Details can be found here with the full logs : https://paste.ubuntu.com/p/2SQRDDjc9x/plain/

probot-home-assistant[bot] commented 3 years ago

zha documentation zha source (message by IssueLinks)

probot-home-assistant[bot] commented 3 years ago

Hey there @dmulcahey, @adminiuga, mind taking a look at this issue as its been labeled with an integration (zha) you are listed as a codeowner for? Thanks! (message by CodeOwnersMention)

basnagel commented 3 years ago

Just a little trouble shooting: Regarding a stable ZigBee network (2.4Ghz) and a -37dBm wifi (2.4ghz) radio jammer next to eachother might not be a good idea.

Jens-Wymeersch commented 3 years ago

@basnagel - I didn't think about this one. I'll move it immediately to another location. Here is by the way my logs : https://paste.ubuntu.com/p/GxWpQQbGPN/

Adminiuga commented 3 years ago

Last log looks good. Watchdog heartbeat timeout: are bad -- there's no communication with the stick.

Jens-Wymeersch commented 3 years ago

I'm struggling with this point now for a few weeks and trying to sort this out. My Home Assistant is running on an Ubuntu/windows VBox wired to the router. The Sonoff gateway is wireless still connected to AP. What do you suggest I do because frankly I'm without ideas and the zigbee network is been out for now almost 24 hours ?

Adminiuga commented 3 years ago

Ezsp protocol was not designed to be ran over unstable networks like wifi. For zha integration to work the serial tco gateway has to be available during zha start. Then maybe it would be able to recover from occasional disconnects, but in your case 234 watchdog timeouts in 8 hours means that you should not run it like this. Get a physical usb stick

Jens-Wymeersch commented 3 years ago

In your documentation (https://www.home-assistant.io/integrations/zha#known-working-zigbee-radio-modules), you recommend the sonoff bridge. Can you please recommend one of them as I don't want to have the same problems again ?

thanks

Adminiuga commented 3 years ago

https://elelabs.com/products/elelabs-usb-adapter.html

Jens-Wymeersch commented 3 years ago

Finally I expect to go over 80 zigbee devices (currently 70 in my house) of which at least 40 will be routers. Should I expect any problems ? Should I do something specific in order to prevent problems ?

Adminiuga commented 3 years ago

🤷 Running 101 devices with slightly less than half of them routers. The only problems are caused by xiaomi devices. And I don't run any HUE lights, so can't comment on those.

Jens-Wymeersch commented 3 years ago

I'll order one tomorrow. This instability drives me nuts. Thank you so much. Here are my list of devices:

Anything I should know about these devices ?

RaraAvis8 commented 3 years ago

I have similar problem. I updated yeserday from 1.114.2 to 1.118.0 and all ZigBee devices went offline. I never experienced any instabilities before. I run HA in Docker on Synology NAS. As ZigBee coordinator I use USB stick with CC2531.

I downgraded to 1.114.2 and it helped. Now all devices are back online. But I unable to update HA until this issue is fixed :cry:

Adminiuga commented 3 years ago

Different radio, different connection type, different issue. Zigpy-cc made a breaking change around 0.115 iirc. Upgrade, then carefully edit .storage/core.config_entries and change 'radio_type' from ti_cc to znp

RaraAvis8 commented 3 years ago

@Adminiuga You were right. The black magic you suggested has helped :smile: Thanks!

erix22 commented 3 years ago

Hi, not sure if it's related/linked to the subject but just in case: I have HA as a VM (virtualbox), Mqtt addon. All the Zigbee devices are managed by a Tasmotized Sonoff Zbbridge (8.5.1) without HA addon.

yesterday evening I've decided to upgrade from Core 0.117.5 to the proposed 0.118.2 everything went well, it was late went to bed..

this morning none of the Zigbee devices are reporting... first none on HA and second none in the Tasmotized Sonoff Zigbee bridge.

When I checked the Tasmotized Sonoff Zigbee bridge GUI the device list is empty... I had to reboot the Tasmotized Sonoff Zigbee bridge to see the list of Zigbee devices on its GUI.. I don't know what happened between the Tasmotized Sonoff Zigbee bridge and HA

Now I can see Zigbee devices values normally updated in HA

Cheers

Adminiuga commented 3 years ago

What was the error and debug log saying?

MattWestb commented 3 years ago

I have rebooting and changing the debug setting multiple time to day then testing my tuy TRV, In the afternoon 2 TRV was online (no problems is not active pronunciation) but then looking was many end devices was offline. In ZHA network card half of the devices was not having LQI. One IKEA on/off switch can steering one plug but the its not reposting battery or answering attribute requests. Repower the outlet that is its parent (ZigZag map) and its start working and 2 Xiaomi weather sensors that was also having the plug as parent. Then the next is the bath is the on/off and Xiaomi weather not working so repower the plug. And last is the Xiaomi water sensor on the balcony its have the old LL IKEA E27 RGBW as parent. Repower the bulb and its start working. 2 time ZHA was not able starting normal and was restarting very late and ZHA-MAP was not loaded because of that i was seeing in the log. I dont have any saved logs then i was fighting with the TRV testing.

I dont knowing if its the same problem or if ZHA / Zigpy / Bellows was having problem starting OK and was messing up the security frame counter on the NCP / TC to some devices (then FC is being unsynced between 2 devices they is throwing all frames but if commands is coming from other devices (other neighbour / different route) with OK FC its working like my on/off switch and plug).

I keeping one eye on the LQI and try getting logs if needed then its happened.

MattWestb commented 3 years ago

Back to the IKEA E27 RGBW. It was reacting of the first groupe command on but after that it was not reacting on broad and unicast at all.

I remember the problem with Tasmota Zigbee Bridge was resetting the frame counter on reboot so it was unsynce and it was exactly the same. First group command was working (its broadcast and is coming from more devices and different routes) and then the device is reporting status back to the coordinator with unicast and then all is blocked because the coordinator is sending all back the same route it was getting the attribute report and the frame counter is blocking the frames. If the device have own children they is working but not to / from the mesh. Repower and its working.

So its not the same problem as this issue and very likely problem then restarting the NCP was loosing / corrupting the security frame counter.