home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
71.53k stars 29.91k forks source link

Zigbee is failing constantly, requires reloading/reinitialization #105506

Closed mmccool closed 7 months ago

mmccool commented 9 months ago

The problem

The Zigbee (ZHA Zigbee Multi-PAN) integration on my HA instance is failing, falling back into a "Failure to setup" state. Reloading it fails again, but then retries, entering an "Initialization" stage, after which it works... for a while, from a few minutes to an hour, before failing again. Occasionally the above does not work and a power cycle is needed to get it to reload.

I am running on Home Assistant Yellow, and in particular am running "Yellow Multi-PAN".

NOTE: originally I posted this issue as a ZHA problem but the problem resolved after disabling Multi-PAN and using plain ZHA, so I think it is a problem specifically with Multi-PAN, not the "baseline" ZHA per se.

What version of Home Assistant Core has the issue?

core-2023.12.1

What was the last working version of Home Assistant Core?

core-2023.10

What type of installation are you running?

Home Assistant OS

Integration causing the issue

ZHA (Multi-PAN)

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/ but with Multi-PAN enabled as per https://yellow.home-assistant.io/guides/enable-multiprotocol/

Diagnostics information

config_entry-zha-d179b6f823a03243c5e4b54db431cebc.json.txt

Example YAML snippet

No response

Anything in the logs that might be useful for us?

[0xDFC0:1:0x0001]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running')]
1:32:45 PM – (WARNING) Zigbee Home Automation - message first occurred at December 9, 2023 at 3:35:49 PM and shows up 126 times
[0x0898:1:0xff01] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 710, in send_packet raise ControllerError("ApplicationController is not running") bellows.exception.ControllerError: ApplicationController is not running The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 518, in write_attributes_safe res = await self.write_attributes(attributes, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 83, in wrapper with wrap_zigpy_exceptions(): File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 75, in wrap_zigpy_exceptions raise HomeAssistantError(message) from exc homeassistant.exceptions.HomeAssistantError: Failed to send request: ApplicationController is not running
1:29:04 PM – (ERROR) runner.py - message first occurred at 1:27:34 PM and shows up 31 times
Discarding _save_attribute event
1:29:01 PM – (WARNING) components/zha/binary_sensor.py - message first occurred at 1:29:01 PM and shows up 19 times
[0x4C2A:1:0x0702]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
1:28:47 PM – (WARNING) Zigbee Home Automation - message first occurred at 1:28:47 PM and shows up 6 times
Unable to remove unknown dispatcher <bound method GroupProbe._reprobe_group of <homeassistant.components.zha.core.discovery.GroupProbe object at 0x7f62493690>>
1:28:46 PM – (WARNING) helpers/dispatcher.py - message first occurred at December 9, 2023 at 4:16:54 PM and shows up 12 times
Error setting up entry Yellow Multi-PAN for zha
December 9, 2023 at 4:36:40 PM – (ERROR) config_entries.py

Details from [0xDFC0:1:0x0001]:
Logger: homeassistant.components.zha.core.cluster_handlers
Source: components/zha/core/cluster_handlers/__init__.py:537
Integration: Zigbee Home Automation (documentation, issues)
First occurred: December 9, 2023 at 3:35:49 PM (126 occurrences)
Last logged: 1:32:45 PM

[0xDFC0:1:0x0001]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running')]
[0xDFC0:1:0x0500]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running')]
[0x9278:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]
[0x9278:1:0x0702]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]
[0x9278:1:0x0b04]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]

Details from [0x0898:1:0xff01]:
Logger: zigpy.zcl
Source: runner.py:188
First occurred: 1:27:34 PM (31 occurrences)
Last logged: 1:29:04 PM

[0x0898:1:0xff01] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 710, in send_packet raise ControllerError("ApplicationController is not running") bellows.exception.ControllerError: ApplicationController is not running The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 518, in write_attributes_safe res = await self.write_attributes(attributes, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 83, in wrapper with wrap_zigpy_exceptions(): File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 75, in wrap_zigpy_exceptions raise HomeAssistantError(message) from exc homeassistant.exceptions.HomeAssistantError: Failed to send request: ApplicationController is not running
[0xB790:1:0xff01] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 710, in send_packet raise ControllerError("ApplicationController is not running") bellows.exception.ControllerError: ApplicationController is not running The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 518, in write_attributes_safe res = await self.write_attributes(attributes, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 83, in wrapper with wrap_zigpy_exceptions(): File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 75, in wrap_zigpy_exceptions raise HomeAssistantError(message) from exc homeassistant.exceptions.HomeAssistantError: Failed to send request: ApplicationController is not running
[0x7426:1:0xff01] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 710, in send_packet raise ControllerError("ApplicationController is not running") bellows.exception.ControllerError: ApplicationController is not running The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 518, in write_attributes_safe res = await self.write_attributes(attributes, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 83, in wrapper with wrap_zigpy_exceptions(): File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 75, in wrap_zigpy_exceptions raise HomeAssistantError(message) from exc homeassistant.exceptions.HomeAssistantError: Failed to send request: ApplicationController is not running
[0xC67C:1:0xff01] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 710, in send_packet raise ControllerError("ApplicationController is not running") bellows.exception.ControllerError: ApplicationController is not running The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 518, in write_attributes_safe res = await self.write_attributes(attributes, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 83, in wrapper with wrap_zigpy_exceptions(): File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 75, in wrap_zigpy_exceptions raise HomeAssistantError(message) from exc homeassistant.exceptions.HomeAssistantError: Failed to send request: ApplicationController is not running
[0xBC11:1:0x000a] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/__init__.py", line 411, in reply return await self._endpoint.reply( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 278, in reply return await self.device.reply( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 474, in reply return await self.request( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 767, in send_packet status, _ = await self._ezsp.sendUnicast( ^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'sendUnicast'

Details from 
Logger: zigpy.appdb
Source: components/zha/binary_sensor.py:227
First occurred: 1:29:01 PM (19 occurrences)
Last logged: 1:29:01 PM

Discarding _save_attribute event

Details from 
Logger: homeassistant.components.zha.core.cluster_handlers
Source: components/zha/core/cluster_handlers/__init__.py:389
Integration: Zigbee Home Automation (documentation, issues)
First occurred: 1:28:47 PM (6 occurrences)
Last logged: 1:28:47 PM

[0x4C2A:1:0x0702]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
[0x4C2A:1:0x0b04]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
[0xE85E:1:0x0006]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
[0xE85E:1:0x0b04]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
[0xE85E:1:0x0702]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/core/helpers.py", line 338, in wrapper
    return await func(cluster_handler, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 389, in async_initialize
    await self._get_attributes(
  File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 491, in _get_attributes
    read, _ = await self.cluster.read_attributes(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/__init__.py", line 524, in read_attributes
    result = await self.read_attributes_raw(to_read, manufacturer=manufacturer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/__init__.py", line 377, in request
    return await self._endpoint.request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request
    return await self.device.request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request
    await self._application.request(
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request
    await self.send_packet(
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 767, in send_packet
    status, _ = await self._ezsp.sendUnicast(
                      ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'sendUnicast'

Detail from 
Logger: homeassistant.helpers.dispatcher
Source: helpers/dispatcher.py:59
First occurred: December 9, 2023 at 4:16:54 PM (12 occurrences)
Last logged: 1:28:46 PM

Unable to remove unknown dispatcher <bound method GroupProbe._reprobe_group of <homeassistant.components.zha.core.discovery.GroupProbe object at 0x7f62493690>>

Details from [0x4C2A:1:0x0702]
Logger: homeassistant.config_entries
Source: config_entries.py:402
First occurred: December 9, 2023 at 4:36:40 PM (1 occurrences)
Last logged: December 9, 2023 at 4:36:40 PM

Error setting up entry Yellow Multi-PAN for zha
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 402, in async_setup
    result = await component.async_setup_entry(hass, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/__init__.py", line 167, in async_setup_entry
    zha_gateway = await ZHAGateway.async_from_config(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 192, in async_from_config
    await instance.async_initialize()
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 210, in async_initialize
    await app.startup(auto_form=True)
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 223, in startup
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 144, in connect
    await ezsp.startup_reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 111, in startup_reset
    await self._gw.wait_for_startup_reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 170, in wait_for_startup_reset
    await self._startup_reset_future
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 1003, in _read_ready__data_received
    self._protocol.data_received(data)
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 76, in data_received
    self.frame_received(frame)
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 103, in frame_received
    self.data_frame_received(data)
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 124, in data_frame_received
    self._application.frame_received(self._randomize(data[1:-3]))
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 305, in frame_received
    self._protocol(data)
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 83, in __call__
    sequence, frame_id, data = self._ezsp_frame_rx(data)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/v4/__init__.py", line 33, in _ezsp_frame_rx
    return data[0], data[2], data[3:]
           ~~~~^^^
IndexError: index out of range

Additional information

I think the problems started with 2023.12.0 but I am not completely sure - it did have a tendency to crash before, but not nearly as often. Before 2023.12 it maybe happened once a week, which was annoying but semi-tolerable. Now it is basically unusable, and essentially breaks any automation that depends on a zigbee device.

I should also mention I have set up and used Matter devices but don't have any active right now. The symptoms do seem similar to #105344 and #105345 - but I have no Xbee devices.

home-assistant[bot] commented 9 months ago

Hey there @dmulcahey, @adminiuga, @puddly, @thejulianjes, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `zha` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zha` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

puddly commented 9 months ago

Would you be able to enable debug logging for the ZHA integration and upload a complete log of it crashing and failing to reload?

mmccool commented 9 months ago

I did enable debug logging, and just reloaded it. Let me check if it's crashed yet... nope, still working. Will do so after it crashes. I had an older log file but I let it run too long and it became too big to upload, so just restarted debug logging again.

mmccool commented 9 months ago

So as an additional observation I have seen errors about the NCP failing to respond, watchdog failure, etc. and then I see it enter an "Initialization" state on its own, but succeeds. I have seen this twice now. I will wait for an unrecoverable error before uploading the logs but the recoverable ones seem to be happening every 5-10m or so (I assuming this is probably the Watchdog restarting it).

tangodelta73 commented 9 months ago

Having the exact same problem! This was changed in 12.1 version: "Fix ZHA quirk ID custom entities matching all devices", could this be the problem?

mmccool commented 9 months ago

Ugh, it finally crashed but the debug log was still too big to upload (over 400MB!). It seems it appended to my old log rather than start a new one, also. I am going to manually edit it just to show the last bit. I will keep the full log around, let me know if you want me to extract any other sections.

mmccool commented 9 months ago

home-assistant_zha_2023-12-11T18-57-42.214Z-trimmed.log

puddly commented 9 months ago

@mmccool In your log, did you manually reload the ZHA integration about five times? If so, can you describe why? Up until the point you reload it, the log looks normal.

Letalis commented 9 months ago

Screenshot_20231211-220948 error_log-2.txt Encounters this from time to time. Especially after a reboot. Needs to disable and activate it . Then it comes to a "normal" state. But accepting ligthcontroll. Gives failed to send signal or something like that. Will gather logs aswell

pimeys commented 9 months ago

I've also witnessed issues with ZHA integration after upgrading to 2023.12. I will upload logs next time when it happens. Usually it breaks after restarting HASS.

puddly commented 9 months ago

@Letalis I think you may have submitted your comment before your screenshot and log files actually finished uploading! Edit them into your original comment, the Uploading ... text will be replaced once they finish uploading.

mmccool commented 9 months ago

For the log, I did NOT manually reload it. For each of those five instances it entered the "Initialization" state on its own and restarted on its own. It succeeded restarting/reloading several times on its own until finally this self-initiated loop failed. I'm guessing these restarts were initiated by a watchdog timer.

Also, I truncated the log. It did this a lot more than five times.

marcroca81 commented 9 months ago

Here the same problem: home-assistant_zha_2023-12-12T09-17-49.702Z.log

Logger: homeassistant.components.zha.core.cluster_handlers Source: components/zha/core/cluster_handlers/init.py:537 Integration: Zigbee Home Automation (documentation, issues) First occurred: 10:13:23 (2 occurrences) Last logged: 10:13:23

[0x4D68:1:0x0500]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')] [0x4D68:1:0x0406]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]

ckadluba commented 9 months ago

Is this the same problem as reported in issue #105445?

I also have the problem that all Zigbee and Matter devices are unavailable after upgrading HA Core to >= 2023.12.0. Zigbee, Matter and Thread integrations are permanently in a failure state.

Citizen-2CB8A24A commented 9 months ago

Same here. 66 devices / 443 entities via RaspBee II on HAOS 2023.12.1. After a few hours ZHA fails and does not recover by itself. Reload of ZHA helps instantly, but just for another few hours. Disabling ZHA's entity update polling doesn't help. CPU and memory is at bay. No WiFi on Zigbee channel. It was running flawlessly for months.

 Logger: zigpy.application
Source: components/zha/core/gateway.py:210
First occurred: 10. Dezember 2023 um 19:05:15 (10 occurrences)
Last logged: 03:46:53

    Zigbee channel 15 utilization is 81.18%!
    If you are having problems joining new devices, are missing sensor updates, or have issues keeping devices joined, ensure your coordinator is away from interference sources such as USB 3.0 devices, SSDs, WiFi routers, etc.
    Zigbee channel 15 utilization is 82.35%!
    Zigbee channel 15 utilization is 76.47%!
    Zigbee channel 15 utilization is 77.65%!
Logger: homeassistant.components.zha.core.cluster_handlers
Source: components/zha/core/cluster_handlers/__init__.py:537
Integration: Zigbee Home Automation (documentation, issues)
First occurred: 10. Dezember 2023 um 19:05:27 (528 occurrences)
Last logged: 13:13:59

[0xC543:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>')]
[0xB202:3:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>')]
[0x1C75:1:0x0b04]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.MAC_NO_ACK: 233>'), TimeoutError(), TimeoutError(), TimeoutError()]
[0x5E7F:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), TimeoutError(), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>')]
[0x3E5F:1:0x0008]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), TimeoutError()]

In the desperate I've enabled Source Routing, which takes off some load from the mesh/coordinator.

zha:
  zigpy_config:
    source_routing: true

(configuration.yaml)

For now it seems to work.

I'll report, if that changes.

What about the 38400 baud limit (which is just 4,8 kbytes/s), we have on the serial side (RaspBee II)? This sounds like a bottleneck to me, if the whole Zigbee protocol is processed by ZHA and might became more complex now. By using ZHA the bandwidth demand on the serial interface is completely different, in comparison of using the whole Zigbee stack on the RaspBee II chip itself (Deconz approach).

I've ordered a SMLIGHT SLZB-06M coordinator, maybe things are different with that type of hardware...

psuurbach commented 9 months ago

+1 keeps getting offline and is not selfhealing, always worked until last update. Same logs as above. Now trying the option of [Citizen-2CB8A24A] Update : suggestion option did not work, only 24 devices 169 entities.

mmccool commented 9 months ago

I should also say I have a fair number of Zigbee devices: 49 devices, 376 entities. However, not so sure it's a bottleneck issue as it worked ok until recently - unless the last update made it significantly less efficient. I'm also running on a "large" CM4 (8GB) so I don't think memory is a problem. Unlike several commentors on https://github.com/home-assistant/core/issues/105445, who seem to have a very similar problem, I'm using "mainstream/supported" HW: the HA Yellow platform. Also, this issue it does try to restart (so does not stay in "Initializing" state) but tends to fail and get stuck in a "Failed to set up" state (typically after passing through "Failed to set up, retrying"...). Reloading tends to fail once, then succeeds after one retry - it does not succeed immediately.

derekgermano commented 9 months ago

Just wanted to chime in... I was having problems with my SONOFF Zigbee 3.0 USB Dongle Plus on my X86 install. ZHA was failing every few hours after installing 2023.12.0. Restored my 2023.11.3 backup. Installed 2023.12.0 again... ZHA did the same thing again... Failing every few hours. Went to my 2023.11.3 backup again and ZHA has been solid on my system for days. I didn't download any logs because I thought it was something that I had done. Going to stay on 2023.11.3 for the time being.

dmulcahey commented 9 months ago

Just wanted to chime in... I was having problems with my SONOFF Zigbee 3.0 USB Dongle Plus on my X86 install. ZHA was failing every few hours after installing 2023.12.0. Restored my 2023.11.3 backup. Installed 2023.12.0 again... ZHA did the same thing again... Failing every few hours. Went to my 2023.11.3 backup again and ZHA has been solid on my system for days. I didn't download any logs because I thought it was something that I had done. Going to stay on 2023.11.3 for the time being.

If you can flip back for a short while and capture logs it would be greatly appreciated

Citizen-2CB8A24A commented 9 months ago

For now it seems to work.

I'll report, if that changes.

Well, ZHA failed again this morning. Source Routing doesn't help. After ZHA integration reload things went back to normal again instantly.

Logger: zigpy.application
Source: components/zha/core/gateway.py:210
First occurred: 07:05:51 (2 occurrences)
Last logged: 07:05:51

Zigbee channel 15 utilization is 77.65%!
If you are having problems joining new devices, are missing sensor updates, or have issues keeping devices joined, ensure your coordinator is away from interference sources such as USB 3.0 devices, SSDs, WiFi routers, etc.
Logger: homeassistant.components.zha.core.cluster_handlers
Source: components/zha/core/cluster_handlers/__init__.py:537
Integration: Zigbee Home Automation (documentation, issues)
First occurred: 12. Dezember 2023 um 23:31:06 (51 occurrences)
Last logged: 07:06:13

[0xA305:1:0x0702]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>')]
[0x67CE:2:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>')]
[0xA305:1:0x0b04]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>')]
[0xC543:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), TimeoutError()]
[0x2E0A:3:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), TimeoutError()]
Citizen-2CB8A24A commented 9 months ago

A preventive reload would be an option (as a workaround). Is there a way to automate a reload of the integration every hour?

markaggar commented 9 months ago

Same thing has happened a few times for me, and just now. It failed and reinitialized a few times in quick succession before I manually intervened. I manually reloaded the integration and it coincidentally stopped failing.

Letalis commented 9 months ago

A preventive reload would be an option (as a workaround). Is there a way to automate a reload of the integration every hour?

If I do a preventive reload or reboot of the HA Blue I have a living hell for a couple of minutes to get it stable again.

Citizen-2CB8A24A commented 9 months ago

A preventive reload would be an option (as a workaround). Is there a way to automate a reload of the integration every hour?

If I do a preventive reload or reboot of the HA Blue I have a living hell for a couple of minutes to get it stable again.

It's not about reloading the box ("reboot"), it's just about reloading the integration. That happens very fast and might work for another hour without failure.

If the integration fails, things are broken anyways. It cannot get worse. Recurring manual intervention cannot be the solution for the meantime.

puddly commented 9 months ago

Please include debug logs in your comments that show ZHA running from startup, crashing, and not reloading. It'll really speed up figuring out the root cause of this problem.

Citizen-2CB8A24A commented 9 months ago

@puddly Thank you for the workaround! I'm enabling debug log now...

markaggar commented 9 months ago

Here's the log failure from last night if it's helpful. Note I'm running the Nortek zigbee/zwave stick via a VM on a Synology NAS. Never had these type of issues until 2023.12.1 (didn't install 2023.12.0)..

Logger: homeassistant.components.zha.core.cluster_handlers Source: components/zha/core/cluster_handlers/init.py:380 Integration: Zigbee Home Automation (documentation, issues) First occurred: 12:04:14 AM (8 occurrences) Last logged: 12:07:11 AM

[0x0377:1:0x0b04]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast' Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/zha/core/helpers.py", line 338, in wrapper return await func(cluster_handler, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/init.py", line 380, in async_initialize await self._get_attributes( File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/init.py", line 491, in _getattributes read, = await self.cluster.read_attributes( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/init.py", line 524, in read_attributes result = await self.read_attributes_raw(to_read, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/init.py", line 377, in request return await self._endpoint.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 767, in sendpacket status, = await self._ezsp.sendUnicast( ^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'sendUnicast'

xyboox commented 9 months ago

My setup is experiencing the same behaviour for the last couple of days ( I suspect since the last update, to 2023.12.1 ).

My network settings: Screenshot 2023-12-13 at 18 46 07

Below are my ZHA debug related logs. Crash happened at what is visible in logs at ~19:46 - 19:49

has-zha.log.zip

@puddly - logs updated here. Thanks!

puddly commented 9 months ago

@markaggar @xyboox When I refer to debug logs, I specifically mean the debug logging you either enable through YAML (https://www.home-assistant.io/integrations/zha/#debug-logging) or with the "enable debug logging" button in ZHA's config. Warning logs or errors aren't of much use here because I need to know everything that happened before the warnings are logged.

If you can, reproduce the issue again and edit the debug logs into your comments.

Letalis commented 9 months ago

@puddly Here comes a debug log from my system. What I did was enabling the debug. Then did a reload which caused somewhat of disturbance. Then I did HA reboot and then it took a while for it to be 90% stable, home-assistant_zha_2023-12-13T17-08-38.741Z.zip

xyboox commented 9 months ago

Here's another series of logs, from next failure ( ~1h after the prev. ) zha-debug.log.zip

LE: 2023.12.2 didn't fix it ( not sure if it addressed it though ).

ckadluba commented 9 months ago

In my environment the problem went away yesterday. All Zigbee and Matter/Thread devices are back. Everything works fine again.

It started working again before I updated to 2023.12.2 but after the fix PRs were merged. It seems strange to me. Is there some kind of over the air update involved?

Later I updated to 2023.12.2. Everything is still working as it should.

Citizen-2CB8A24A commented 9 months ago

In my environment the problem went away yesterday. All Zigbee and Matter/Thread devices are back. Everything works fine again.

It became quite indeed. I'm still on 12.1 and will stay there as long as needed for delivering a qualified debug log.

This represents the pattern experienced:

Bildschirmfoto vom 2023-12-14 12-06-41

derekgermano commented 9 months ago

Just wanted to chime in... I was having problems with my SONOFF Zigbee 3.0 USB Dongle Plus on my X86 install. ZHA was failing every few hours after installing 2023.12.0. Restored my 2023.11.3 backup. Installed 2023.12.0 again... ZHA did the same thing again... Failing every few hours. Went to my 2023.11.3 backup again and ZHA has been solid on my system for days. I didn't download any logs because I thought it was something that I had done. Going to stay on 2023.11.3 for the time being.

If you can flip back for a short while and capture logs it would be greatly appreciated

Just wanted to give an update. Last evening, I enabled debug logging then upgraded from 2023.11.3 to 2023.12.2. So far, there hasn't been any ZHA failures. I will continue to monitor.

mmccool commented 9 months ago

I upgraded to 2023.12.2 as well earlier today. It was stable for about an hour, then failed - but managed to recover on its own. However, as it's not 100% stable I would not consider this resolved (especially if the problem has not been root caused or addressed yet with targeted fixes). I'll let it run at least a few more hours and then report back on its behavior.

mmccool commented 9 months ago

Unfortunately 2023.12.2 has not resolved the problem. It did last for three hours but it seems to have automatically restarted several times before finally crashing again. I will try to capture some debug logs tomorrow.

BTW when it is down it also makes the whole interface super slow and basically unusable on a mobile device, even for non-zigbee devices. Too many "unavailable" entities on the dashboard...

image

Citizen-2CB8A24A commented 9 months ago

After about 24 hours ZHA not failing I decided to disable HW flow control as suggested in the mentioned thread. The flow control was one of the first measures I did, to try to mitigate the instabilities. After the first failures I took even the chance to try zigbee2mqtt, which I stopped immediately, after the very first thing (a plug from the TS011F family) was not recognized. When coming back from that and while reactivating ZHA I enabled HW flow control the very first time. So the ZHA instabilities came before activating HW flow control. I'm out of clues whats happening here. Afaik we have a high sunspot count at present. ;-)

This weekend I'll try to use my VNA for radio spectrum analysis.

Debug logging is still on and I'm holding at 12.1.

Citizen-2CB8A24A commented 9 months ago

I'm out of clues whats happening here.

I'm still curious about the bandwidth demand at the serial side, if the speed is limited to 38,400 baud (RaspBee II). If the ZigBee network has interference problems at times (the 2.4GHz ISM band is dirty, no doubts about that) my assumption is, that ZHA needs to do many repetitions of ZigBee commands through that bottleneck. If ZHA is forced to repeat itself that massively, there will be a saturation of the serial interface at some point. Once there are occasional ZigBee network problems, having many devices do add to the problem.

It would be different, if the RaspBee II unit does those repetitions itself, since they wont need to go through the (limited) serial interface (Deconz-mode). So people who use Deconz shouldn't have those problems.

Outsourcing the ZigBee stack might be a real culprit here. If that is the point, the ZHA integration should be modified to handle such a saturation by some sort of throttling further commands, instead of crashing and/or whining about impossible interface initializations.

I have no real clue here, just some thoughts. ;-P

simonwestenholz commented 9 months ago

https://ground.news/article/biggest-solar-flare-in-years-temporarily-disrupts-radio-signals-on-earth_589615 Curious coincidence.

evelant commented 9 months ago

I'm seeing problems after upgrading to 2023.12.3 using zigbee2mqtt. I'm not sure why they would be related, but zigbee devices become unavilable in HA after about 1hr of uptime. Restarting z2m fixes it for another hour then it reoccurs. Running HA os on an rpi4, latest update, maybe something on the OS side is breaking consistent communication with zigbee sticks?

Citizen-2CB8A24A commented 9 months ago

ZHA died again this night without recovering. The ZigBee mesh was offline for about 11 hours. @puddly How do you want me to provide the debug log? It's zip'ed size is 50 MB.

puddly commented 9 months ago

@Citizen-2CB8A24A email is fine: puddly3@gmail.com.

pimeys commented 9 months ago

This is crazy, we just came home from a restaurant, and none of the lights worked. Same error as in the logs above, restarted HAOS multiple times and finally downgraded to the latest november version. Now the lights work. What was the update that caused this, do we know yet?

puddly commented 9 months ago

@pimeys As mentioned above, please upload a complete debug log and mention what version of HA you're running. I've had only a few people upload info that I can use. Without these logs, I can't do much but wait until someone having issues uploads them.

xyboox commented 9 months ago

@pimeys well, at least during the day, you have no problems. On my side, I have the heating system depending on the ZHA. Keeps getting cold... Just kidding ( well... almost )! I can't decipher the debug logs... can't sot the culprit. And is very frustrating.

xyboox commented 9 months ago

@puddly please let me know if you need more logs from me ( uploaded some, above ).

benswinney commented 9 months ago

Same issue. Zigbee devices go offline several times during the day and fail to recover. Usually requires a restart of HA to fix. It's never a set time it happens, it's very random

Issues persist since the 2023.12.x upgrade. Have been hoping each point release fixes it, but not to date.

I have approx 42 devices, the majority of those are routers rather than edge devices.

I'll enable debug and will email the logs to help narrow down the issue. Last time I did that the debug logs were quite large, so will try and do a smaller time frame

Citizen-2CB8A24A commented 9 months ago

@puddly You've found the underlying problem, congratulations! 👍

https://github.com/zigpy/zigpy-deconz/pull/241

Letalis commented 9 months ago

@puddly a new set of logs are sent to your email.

markaggar commented 9 months ago

@puddly - another log if you need it. https://1drv.ms/u/s!AnaRLop754fh0_0rfCwArsa35_GLfw?e=wAHSVS

This was captured while I was trying to figure out why my Centralite contact sensors (Lowes Iris) have not been working reliably for the past few months (state reporting is slow, and sometimes comes out of order or misses the open or close state change). As I was opening and closing the doors many times, the sensor state went unavailable (along with all the other sensors), and I downloaded diagnostics within a minute or so of this happening.