home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
71.11k stars 29.79k forks source link

SONOFF ZigBee 3.0 Dongle E going unresponsive after a while #106185

Open zsmb13 opened 8 months ago

zsmb13 commented 8 months ago

The problem

I have a setup with a couple dozen Zigbee devices, coordinated by a SONOFF ZigBee 3.0 Dongle E. Every couple days the dongle / ZHA crash in an unrecoverable way, which I can only solve by physically unplugging and replugging the USB stick. I'm using firmware 7.3.1.0 build 176, flashed by the web-based tool.

I've captured a couple stacktraces now when this happened, hopefully there's something meaningful in there. Let me know what other details I can provide.

What version of Home Assistant Core has the issue?

core-2023.12.3

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Container

Integration causing the issue

zha

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/

Diagnostics information

Diagnostics, captured after recovering from a failure already:

zha-4a8c1b710746a49bb72205002a49a0b7-Zigbee Coordinator-8733fd8a1920e10f5420c786d6c31417.json.txt

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Example 1, going unresponsive:

2023-12-20 21:13:34.973 DEBUG (MainThread) [homeassistant.components.zha] Failed to set up ZHA
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 261, in reset
    return await self._reset_future
           ^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/__init__.py", line 163, in async_setup_entry
    zha_gateway = await ZHAGateway.async_from_config(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 182, in async_from_config
    await instance.async_initialize()
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 200, in async_initialize
    await app.startup(auto_form=True)
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 223, in startup
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 139, in connect
    await ezsp.startup_reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 119, in startup_reset
    await self.reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 145, in reset
    await self._gw.reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 260, in reset
    async with asyncio_timeout(RESET_TIMEOUT):
  File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

-----------

Example #2, again going unresponsive:

2023-12-21 16:54:19.184 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: TimeoutError()
2023-12-21 16:54:19.184 WARNING (MainThread) [zigpy.application] Watchdog failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 68, in command
    return await future
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 661, in _watchdog_loop
    await self._watchdog_feed()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 893, in _watchdog_feed
    (res,) = await self._ezsp.readCounters()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 67, in command
    async with asyncio_timeout(EZSP_CMD_TIMEOUT):
  File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError
2023-12-21 16:54:19.264 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: TimeoutError()
2023-12-21 16:54:19.266 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: TimeoutError()
2023-12-21 16:54:19.266 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop
2023-12-21 16:54:19.268 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Shutting down ZHA ControllerApplication

2023-12-21 16:54:26.946 DEBUG (MainThread) [homeassistant.components.zha] Failed to set up ZHA
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 261, in reset
    return await self._reset_future
           ^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/__init__.py", line 163, in async_setup_entry
    zha_gateway = await ZHAGateway.async_from_config(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 182, in async_from_config
    await instance.async_initialize()
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 200, in async_initialize
    await app.startup(auto_form=True)
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 223, in startup
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 139, in connect
    await ezsp.startup_reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 119, in startup_reset
    await self.reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 145, in reset
    await self._gw.reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 260, in reset
    async with asyncio_timeout(RESET_TIMEOUT):
  File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

----

Example 2, after remounting the USB drive from the command line and restarting Home Assistant, still failing:

2023-12-21 17:02:45.033 DEBUG (MainThread) [homeassistant.components.zha] Failed to set up ZHA
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 261, in reset
    return await self._reset_future
           ^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/__init__.py", line 163, in async_setup_entry
    zha_gateway = await ZHAGateway.async_from_config(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 182, in async_from_config
    await instance.async_initialize()
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 200, in async_initialize
    await app.startup(auto_form=True)
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 223, in startup
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 139, in connect
    await ezsp.startup_reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 119, in startup_reset
    await self.reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 145, in reset
    await self._gw.reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 260, in reset
    async with asyncio_timeout(RESET_TIMEOUT):
  File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

Additional information

No response

home-assistant[bot] commented 8 months ago

Hey there @dmulcahey, @adminiuga, @puddly, @thejulianjes, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `zha` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zha` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

zsmb13 commented 8 months ago

Got one more example, though I don't expect there's much new info in here:

2023-12-21 17:39:59.603 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: TimeoutError()
2023-12-21 17:39:59.603 WARNING (MainThread) [zigpy.application] Watchdog failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 68, in command
    return await future
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 661, in _watchdog_loop
    await self._watchdog_feed()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 893, in _watchdog_feed
    (res,) = await self._ezsp.readCounters()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 67, in command
    async with asyncio_timeout(EZSP_CMD_TIMEOUT):
  File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError
2023-12-21 17:39:59.635 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: TimeoutError()
2023-12-21 17:39:59.638 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: TimeoutError()
2023-12-21 17:39:59.638 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop
2023-12-21 17:39:59.638 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Shutting down ZHA ControllerApplication
...
2023-12-21 17:40:08.901 DEBUG (MainThread) [homeassistant.components.zha] Failed to set up ZHA
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 261, in reset
    return await self._reset_future
           ^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/__init__.py", line 163, in async_setup_entry
    zha_gateway = await ZHAGateway.async_from_config(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 182, in async_from_config
    await instance.async_initialize()
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 200, in async_initialize
    await app.startup(auto_form=True)
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 223, in startup
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 139, in connect
    await ezsp.startup_reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 119, in startup_reset
    await self.reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 145, in reset
    await self._gw.reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 260, in reset
    async with asyncio_timeout(RESET_TIMEOUT):
  File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError
zsmb13 commented 8 months ago

Seems like I found a solution that fixes this without unplugging physically, though this is still far from great:

  1. Unmount USB (command works on my Synology system at least)
    sudo sh -c "echo 0 > /sys/bus/usb/devices/usb1/authorized"
  2. Restart Home Assistant
  3. Remount USB
    sudo sh -c "echo 1 > /sys/bus/usb/devices/usb1/authorized"

Not sure if it's possible to skip the Home Assistant restart step. Perhaps it's enough to wait for a while between unmount/remount, I'll try it next time it fails.

puddly commented 8 months ago

Could you post a complete ZHA debug log from startup up until the crash?

zsmb13 commented 8 months ago

Is there a good way to redact these logs, or share them privately at least? There's a lot of data in there, and one of the logs I have is also quite large (500 MBs).

puddly commented 8 months ago

Thanks! Compressing the log usually shrinks it significantly. If you want to send it privately, you can email it/a link to it to puddly3@gmail.com.

zsmb13 commented 8 months ago

Sent the logs over email. Thank you!

zsmb13 commented 8 months ago

Had it fail again just this morning. I've sent more logs over email.

This time I tried unmounting and remounting the USB device from the command line without restarting Home Assistant. This was unsuccessful, ZHA initialization got stuck for at least 5 minutes, at which point I restarted Home Assistant. That restart after unmounting/remounting fixed things, and ZHA initialized successfully.

So it seems like the unmount/remount steps and a HA restart are both required for a fix, but the ordering of these can vary a bit.

puddly commented 8 months ago

Unfortunately, it looks like your dongle is just outright crashing. Could your hypervisor/host OS be preventing it from communicating?

Did this work in the past? Can you test an older version of ZHA? Your dongle is crashing during a quiet period so it just happens with no clear cause.

One thing to try is to downgrade to the last stable release of EZSP 6.9.x.

zsmb13 commented 8 months ago

I'll see if I can find any problems around how the device is mounted into Docker, but I have nothing that would indicate that the problem is there. I've set up things recently, so I don't really have experience with previous versions of things. I'll also see if I can perform a firmware downgrade and if that helps.

Is there any chance that this could be a hardware fault, could it be worth returning the dongle and replacing it with a new one?

sidwin9 commented 8 months ago

Hello, I have the same problems. For Zigbee, nothing works correctly since version 2023.12.0. I carried out tests with zigbee2mqtt and everything works perfectly, the problem is ZHA. I use a Sonoff zigbee 3.0 version P dongle. It is no. longer even possible to detect new hardware

jeffh2os commented 8 months ago

My Zigbee also died after the upgrade and as it's in a vm I restored to the snapshot to before the upgrade and everything worked fine again. The hardware is working normally on: Core 2023.11.2 Supervisor 2023.12.0 Operating System 11.1 Frontend 20231030.2 However when I upgrade Core to 2023.12.4 I loose Zigbee totally. I have tried this several times and each time Zigbee dies I roll it back to the before upgrade snapshot and it's fine again so I know it's not the hardware. I have not pulled logs yet as I'm not sure which ones are viable for troubleshooting this issue currently. Hardware is Sonoff 3.0 USB Dongle Plus , Model ZBDongle-E

sidwin9 commented 8 months ago

Hello, For me it doesn't work properly when restoring an earlier version like 2023.12.2. Big problems persist.

puddly commented 8 months ago

@sidwin9 You aren't using a Sonoff Dongle-E. Please don't cross-post to multiple issues.

arkovda commented 8 months ago

Hi, I have the same problem core-2023.12.4 supervisor-2023.12.0 SONOFF Zigbee 3.0 USB Dongle Plus (ZBDongle-E)

  1. Hade problems and did a restore from a previous backup
  2. The problem vanished.
  3. I tried upgrading to latest versions
  4. does not work anymore.
zsmb13 commented 8 months ago

Got a bit more on the issue, it's still happening, and what's worse is that it can seemingly take down Home Assistant entirely. After ZHA started failing today, I restarted Home Assistant (which on its own did not fix the issue), and after 15 minutes it stopped responding altogether. It's one thing that ZHA is failing, but it should not be able to make Home Assistant unaccessible.

The Docker container is still running and consuming resources:

image

However, it stopped producing logs at 21:13 with the very last message in the log being this:

2023-12-30 21:13:09.330 DEBUG (MainThread) [bellows.ezsp] Resetting EZSP
arkovda commented 8 months ago

Hi, I got it working today. I Believe it might have been Silicon Labs Multiprotocol (Current version: 2.3.2) causing the issue but I was playing around with Silicon Labs Flasher Add-on. Could be relevant.

scality-jeff commented 8 months ago

Update: For me the latest updates seemed to have resolved whatever issue there was, Zigbee working normally now. Running: Core 2024.1.2 Supervisor 2023.12.0 Operating System 11.3 Frontend 20240104.0

zsmb13 commented 8 months ago

@scality-jeff What device and firmware are you running?

scality-jeff commented 8 months ago

@zsmb13 I've got SONOFF Zigbee 3.0 USB Dongle Plus V2 (from home assistant) From product package - Zigbee 3.0 USB Model ZBDongle-E This is in vmware with the usb dongle as pass through from the vm host to the vm.

dmulcahey commented 7 months ago

Try the release later today. It contains stability improvements for SI radios.

issue-triage-workflows[bot] commented 4 months ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

zsmb13 commented 4 months ago

I have a restart script for this now that remounts the dongle and restarts HA (this triggers once every couple days) and I also reboot HA every night, which sort of solves the problem, but the underlying issue of the dongle going unresponsive is still very much there.

watsug commented 4 months ago

I also faced this issue and I cannot find a way to solve it. In my case, it's about the latest HA running on HA Green. I have tried ZigBee Dongle E with the pretty old FW - 6.10.3.0 build 297 and also the more up-to-date one - 7.x. In both cases after max one week the Zigbee died... ;(

zsmb13 commented 1 month ago

Still getting this, more and more recently. These are the exceptions I can find in my logs now:

2024-07-14 21:23:22.038 DEBUG (MainThread) [homeassistant.components.zha] Failed to set up ZHA
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/__init__.py", line 152, in async_setup_entry
    zha_gateway = await ZHAGateway.async_from_config(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 197, in async_from_config
    await instance.async_initialize()
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 215, in async_initialize
    await app.startup(auto_form=True)
  File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 233, in startup
    await self.connect()
  File "/usr/local/lib/python3.12/site-packages/bellows/zigbee/application.py", line 149, in connect
    await ezsp.startup_reset()
  File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 125, in startup_reset
    await self.reset()
  File "/usr/local/lib/python3.12/site-packages/bellows/ezsp/__init__.py", line 151, in reset
    await self._gw.reset()
TimeoutError
2024-07-14 21:23:22.041 INFO (MainThread) [universal_silabs_flasher.flasher] Probing ApplicationType.GECKO_BOOTLOADER at 115200 baud
2024-07-14 21:23:22.041 DEBUG (MainThread) [zigpy.serial] Opening a serial connection to '/dev/ttyACM0' (115200 baudrate)
2024-07-14 21:23:23.042 INFO (MainThread) [universal_silabs_flasher.flasher] Probing ApplicationType.CPC at 460800 baud
2024-07-14 21:23:23.043 DEBUG (MainThread) [zigpy.serial] Opening a serial connection to '/dev/ttyACM0' (460800 baudrate)
2024-07-14 21:23:24.044 INFO (MainThread) [universal_silabs_flasher.flasher] Probing ApplicationType.CPC at 115200 baud
2024-07-14 21:23:24.044 DEBUG (MainThread) [zigpy.serial] Opening a serial connection to '/dev/ttyACM0' (115200 baudrate)
2024-07-14 21:23:25.046 INFO (MainThread) [universal_silabs_flasher.flasher] Probing ApplicationType.CPC at 230400 baud
2024-07-14 21:23:25.046 DEBUG (MainThread) [zigpy.serial] Opening a serial connection to '/dev/ttyACM0' (230400 baudrate)
2024-07-14 21:23:26.047 INFO (MainThread) [universal_silabs_flasher.flasher] Probing ApplicationType.EZSP at 115200 baud
2024-07-14 21:23:26.048 DEBUG (MainThread) [zigpy.serial] Opening a serial connection to '/dev/ttyACM0' (115200 baudrate)
2024-07-14 21:23:58.055 DEBUG (MainThread) [homeassistant.components.zha.repairs.wrong_silabs_firmware] Failed to probe application type
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/repairs/wrong_silabs_firmware.py", line 87, in probe_silabs_firmware_type
    await flasher.probe_app_type()
  File "/usr/local/lib/python3.12/site-packages/universal_silabs_flasher/flasher.py", line 235, in probe_app_type
    raise RuntimeError("Failed to probe running application type")
RuntimeError: Failed to probe running application type
zsmb13 commented 1 month ago

I'm now on Core 2024.7.2, and things definitely got worse recently. My instance can't go 24 hours without the Zigbee connection failing as described above. Sometimes it falls over every hour.

ShieldsUP78 commented 1 month ago

I'm having exactly the same issue as @zsmb13 . My Sonoff Dongele E just disconnected from the USB and doesn't show up in hardware anmore. I need to unplug and re plug it in. It seems to happen overnight mainly when everything is quiet. It onlt started doing this a couple months ago after I upgraded my verions of HAOS.

zsmb13 commented 1 month ago

I gave up and bought a Home Assistant Connect ZBT-1. No issues for 2.5 weeks now since I'm using it. Migration was a couple minutes, well worth it.

ShieldsUP78 commented 1 month ago

thanks @zsmb13 I think I'll try that also. What did you need to do to migrate? I have my HAOS installed on a mini-pc so I'm assumning I can ust plug into one of the USB ports?

zsmb13 commented 1 month ago

I run in Docker with ZHA. I pretty much followed these steps https://connectzbt1.home-assistant.io/migrate-zigbee-zha-only/

Also checked a couple other places that described this, YouTube, forum, etc., but this guide seems to be accurate. Also worth doing just a regular backup before starting the migration, so that you have a file to restore from, whatever happens.