home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
69.94k stars 29.03k forks source link

HA unresponsive after `[bellows.ezsp] NCP entered failed state. Requesting APP controller restart` #119872

Open Antriel opened 3 weeks ago

Antriel commented 3 weeks ago

The problem

Since 2024.6.3 I sometimes (around once a day) find my HA unresponsive. I can connect via app/web, but it gets stuck on loading. I had this error before: https://github.com/home-assistant/core/issues/116862, and the last update fixed that issue, but unfortunately I now get another error. When it happens, no zigbee device/automations work, although non-zigbee devices keep being able to talk to HA (I can see data history for them afterwards). The only thing that helps is either disconnecting and reconnecting the zigbee USB (SLZB-06m), or rebooting the server (RPI5) by replugging the power.

What version of Home Assistant Core has the issue?

core-2024.6.3

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

zha

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/

Diagnostics information

config_entry-zha-3887042cefb3201b61d2be45b6b9e134.json

Example YAML snippet

No response

Anything in the logs that might be useful for us?

2024-06-18 00:14:27.348 ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart
2024-06-18 00:14:27.350 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/sensor.py", line 272, in _refresh
    await self.async_update()
  File "/usr/src/homeassistant/homeassistant/components/zha/entity.py", line 256, in async_update
    await asyncio.gather(*tasks)
  File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/homeautomation.py", line 128, in async_update
    result = await self.get_attributes(attrs, from_cache=False, only_cache=False)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 512, in _get_attributes
    read, _ = await self.cluster.read_attributes(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy/zcl/__init__.py", line 528, in read_attributes
    result = await self.read_attributes_raw(to_read, manufacturer=manufacturer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy/quirks/__init__.py", line 289, in read_attributes_raw
    results = await super().read_attributes_raw(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy/quirks/__init__.py", line 334, in _read_attributes
    return await super()._read_attributes(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy/zcl/__init__.py", line 377, in request
    return await self._endpoint.request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy/endpoint.py", line 265, in request
    return await self.device.request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy/device.py", line 339, in request
    await send_request()
  File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 841, in request
    await self.send_packet(
  File "/usr/local/lib/python3.12/site-packages/bellows/zigbee/application.py", line 869, in send_packet
    status, _ = await self._ezsp.sendUnicast(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 215, in _command
    return await self._protocol.command(name, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/protocol.py", line 77, in command
    await self._gw.send_data(data)
  File "/usr/local/lib/python3.12/site-packages/bellows/uart.py", line 57, in send_data
    await self._transport.send_data(data)
  File "/usr/local/lib/python3.12/site-packages/bellows/ash.py", line 680, in send_data
    await asyncio.shield(
  File "/usr/local/lib/python3.12/site-packages/bellows/ash.py", line 631, in _send_data_frame
    await ack_future
bellows.ash.NcpFailure
2024-06-18 00:14:27.353 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/sensor.py", line 272, in _refresh
    await self.async_update()
  File "/usr/src/homeassistant/homeassistant/components/zha/entity.py", line 256, in async_update
    await asyncio.gather(*tasks)
  File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/smartenergy.py", line 325, in async_update
    result = await self.get_attributes(attrs, from_cache=False, only_cache=False)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 512, in _get_attributes
    read, _ = await self.cluster.read_attributes(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy/zcl/__init__.py", line 528, in read_attributes
    result = await self.read_attributes_raw(to_read, manufacturer=manufacturer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy/quirks/__init__.py", line 289, in read_attributes_raw
    results = await super().read_attributes_raw(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy/quirks/__init__.py", line 334, in _read_attributes
    return await super()._read_attributes(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy/zcl/__init__.py", line 377, in request
    return await self._endpoint.request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy/endpoint.py", line 265, in request
    return await self.device.request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy/device.py", line 339, in request
    await send_request()
  File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 841, in request
    await self.send_packet(
  File "/usr/local/lib/python3.12/site-packages/bellows/zigbee/application.py", line 869, in send_packet
    status, _ = await self._ezsp.sendUnicast(
                      ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'sendUnicast'

Additional information

After this issue happens, I get: WARNING (MainThread) [homeassistant.helpers.entity] Update of binary_sensor.rpi_power_status is taking over 10 seconds, and then every 30 seconds this line repeats: WARNING (MainThread) [homeassistant.components.binary_sensor] Updating rpi_power binary_sensor took longer than the scheduled update interval 0:00:30.

home-assistant[bot] commented 3 weeks ago

Hey there @dmulcahey, @adminiuga, @puddly, @thejulianjes, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `zha` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zha` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

mhentschke commented 3 weeks ago

I am getting a very similar issue about every 10 minutes, where after upgrade I get the NCP Entered Failed State Error and another one for Task exception was never retrieved. This last one has different stacktrace inside:

`Logger: homeassistant Source: components/zha/core/cluster_handlers/init.py:82 First occurred: 1:51:15 PM (1 occurrences) Last logged: 1:51:15 PM

Error doing job: Task exception was never retrieved (None) Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 682, in async_device_initialized await self._async_device_rejoined(zha_device) File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 726, in _async_device_rejoined await zha_device.async_configure() File "/usr/src/homeassistant/homeassistant/components/zha/core/device.py", line 604, in async_configure await self.identify_ch.trigger_effect( File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/init.py", line 82, in wrapper return await RETRYABLE_REQUEST_DECORATOR(func)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/zigpy/util.py", line 131, in retry return await func() ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/zigpy/zcl/init.py", line 377, in request return await self._endpoint.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/zigpy/endpoint.py", line 265, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/zigpy/device.py", line 339, in request await send_request() File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 841, in request await self.send_packet( File "/usr/local/lib/python3.12/site-packages/bellows/zigbee/application.py", line 857, in send_packet await self._ezsp.setExtendedTimeout(device.ieee, True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'setExtendedTimeout'`

By the way, I'm using ezsp radio on a Tasmota flashed Sonoff Zigbee Bridge

cdvreede commented 1 week ago

having same problem - any progress on this one??

puddly commented 1 week ago

After this issue happens, I get: WARNING (MainThread) [homeassistant.helpers.entity] Update of binary_sensor.rpi_power_status is taking over 10 seconds, and then every 30 seconds this line repeats: WARNING (MainThread) [homeassistant.components.binary_sensor] Updating rpi_power binary_sensor took longer than the scheduled update interval 0:00:30.

Both this and #116862 are not issues that can be fixed.

From the log above, it looks like something is stuttering Home Assistant's event loop by 30+ seconds: any delay of more than 6 seconds will cause the Zigbee radio to assume the connection between it and the computer has been broken and the radio will reset.

You likely have some rogue custom integration, addon, or filesystem problem that causes huge delays.

Antriel commented 1 week ago

I assume the rpi_power_status is just a consequence of staying in invalid state, I only included it for completeness, as I don't know what could be important.

The assumption that the zigbee radio connection has been broken, and HA wanting to reset it... I suppose that's fine (if not, that's secondary issue), but it shouldn't be left in invalid state without it being able to properly restart itself. I'm not sure if it's just some unhandled exception, or the actual driver bugs out, or w/e.

Parts of my zigbee network are pretty unstable at times, and maybe that's causing the issue. Some end devices need around 4 dedicated routers to hop through, and they often end up unresponsive after a few days – I'm not sure if that might be the source of this issue. Regardless, it shouldn't break the entire network.

Is there any way to diagnose which integration/addon/filesystem problem could be the cause? I only have 2 unofficial addons that use bluetooth and don't have support for anything zigbee, pretty much everything is else official HA stuff, or some widely used ones like Plex. I don't see anything but ZHA that could interfere with zigbee.