home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
71.11k stars 29.79k forks source link

ZHA Resets Randomly - All Devices Go Offline #107334

Closed cameron686 closed 3 months ago

cameron686 commented 8 months ago

The problem

About once per day, every device on my Zigbee network will go offline. ZHA will re-initialize and the devices will eventually come back. After ZHA re-initializes, some devices will have a very delayed response the first time they're switched, and occasionally some will fail to respond alltogether.

I'm using a Sonoff Zigbee 3.0 USB dongle.

What version of Home Assistant Core has the issue?

12.4

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

ZHA

Link to integration documentation on our website

No response

Diagnostics information

I have uploaded two logs: Starting from a fresh reboot on 1/3 Starting at midnight on 1/5 (the most recent occurrence of the problem was the morning of 1/5)

Both are too big to upload to Github, so I'm hosting them on Backblaze.

Example YAML snippet

No response

Anything in the logs that might be useful for us?

The issue appears have occurred at multiple times on 1/5 (based on when devices went offline): 0110, 0419, 0427, 0950, and at 1806 (while I was downloading the logs).

The problem seems to correspond with "NCP entered failed state. Requesting APP controller restart" in the system log (caused by bellows.ezsp).

Additional information

No response

puddly commented 8 months ago

Can you upgrade to 2024.1.1?

cameron686 commented 8 months ago

Can you upgrade to 2024.1.1?

Sure. I'll upgrade and turn debug logging back on just in case.

serrnovik commented 8 months ago

Had exactly the same issue after updating yesterday (HA core + SkyConnect Silicon Labs Multiprotocol) not sure who cased this. "NCP entered failed state." and then crash that HA restart would fix. Happended once in few hours. Updated again both to swifly arrived minor updates 2024.1.2 and 2.4.1. Monitoring.

keithcroshaw commented 8 months ago

Having once daily ZHA issues as well. Stop’s respond around 10PM EST. Reloading the integration doesn’t fix it. I’ve gone into settings preparing to restart core every night but there’s an update so, killing two birds with one stone. Maybe todays update will prevent tomorrow’s failure (or failure to fail hopefully)🤞

puddly commented 8 months ago

Can you enable ZHA debug logging for an hour before it happens and then disable it once the integration is not working?

keithcroshaw commented 8 months ago

I will test around 8, make sure it’s working and then enable it.

cameron686 commented 8 months ago

Can you enable ZHA debug logging for an hour before it happens and then disable it once the integration is not working?

It usually happens in the middle of the night, but it's really kind of random, so I don't have any way of knowing one hour ahead of time before it happens. I have just updated to the newest minor update and re-enabled debug logging. I'll post logs the next time it happens.

brylee123 commented 8 months ago

For me, this happens around 7 AM EST daily for the last two or three days. I have the HUSBZB-1 by Nortek (HubZ Smart Home Controller). Running 2024.1.2 on a RPi4

Edit: Now it does this around 7 AM and 9 AM.

dmulcahey commented 8 months ago

Do any folks with seemingly reproducible times have any integrations or backups or automations scheduled at the same time? We had a user last week with the Google calendar custom component stuttering the event loop causing stability issues. We run in the event loop… and the radio stacks are latency sensitive… anything else stuttering the loop will cause stuff like this.

keithcroshaw commented 8 months ago

I have a Google calendar check at midnight but my issue is observed around 10 PM. I’ll do some more testing to see if it happens earlier. The rest of my automations are in Node-Red. Are they separate from the event loop you’re talking about? NR has many more automations to go through then.

Huug4922 commented 8 months ago

I have the same problem. Sometimes it happens in one hour, sometimes longer. I have the Conbee3 running.

2024-01-07 21:34:13.947 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, '\n            INSERT INTO attributes_cache_v12\n            VALUES (:ieee, :endpoint_id, :cluster_id, :attrid, :value, :timestamp)\n                ON CONFLICT (ieee, endpoint_id, cluster, attrid) DO UPDATE\n                SET value=excluded.value, last_updated=excluded.last_updated\n                WHERE\n                    value != excluded.value\n                    OR :timestamp - last_updated > :min_update_delta\n            ', {'ieee': 00:17:88:01:0d:54:17:3b, 'endpoint_id': 11, 'cluster_id': 768, 'attrid': 3, 'value': 31105, 'timestamp': 1704659653.928778, 'min_update_delta': 30.0}) completed
2024-01-07 21:34:13.948 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method commit of sqlite3.Connection object at 0x7f10647bdb70>)
2024-01-07 21:34:13.948 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method commit of sqlite3.Connection object at 0x7f10647bdb70>) completed
2024-01-07 21:34:13.948 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, '\n            INSERT INTO attributes_cache_v12\n            VALUES (:ieee, :endpoint_id, :cluster_id, :attrid, :value, :timestamp)\n                ON CONFLICT (ieee, endpoint_id, cluster, attrid) DO UPDATE\n                SET value=excluded.value, last_updated=excluded.last_updated\n                WHERE\n                    value != excluded.value\n                    OR :timestamp - last_updated > :min_update_delta\n            ', {'ieee': 00:17:88:01:0d:54:17:3b, 'endpoint_id': 11, 'cluster_id': 768, 'attrid': 4, 'value': 27085, 'timestamp': 1704659653.92881, 'min_update_delta': 30.0})
2024-01-07 21:34:13.948 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, '\n            INSERT INTO attributes_cache_v12\n            VALUES (:ieee, :endpoint_id, :cluster_id, :attrid, :value, :timestamp)\n                ON CONFLICT (ieee, endpoint_id, cluster, attrid) DO UPDATE\n                SET value=excluded.value, last_updated=excluded.last_updated\n                WHERE\n                    value != excluded.value\n                    OR :timestamp - last_updated > :min_update_delta\n            ', {'ieee': 00:17:88:01:0d:54:17:3b, 'endpoint_id': 11, 'cluster_id': 768, 'attrid': 4, 'value': 27085, 'timestamp': 1704659653.92881, 'min_update_delta': 30.0}) completed
2024-01-07 21:34:13.949 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method commit of sqlite3.Connection object at 0x7f10647bdb70>)
2024-01-07 21:34:13.949 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method commit of sqlite3.Connection object at 0x7f10647bdb70>) completed
2024-01-07 21:34:13.949 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, '\n            INSERT INTO attributes_cache_v12\n            VALUES (:ieee, :endpoint_id, :cluster_id, :attrid, :value, :timestamp)\n                ON CONFLICT (ieee, endpoint_id, cluster, attrid) DO UPDATE\n                SET value=excluded.value, last_updated=excluded.last_updated\n                WHERE\n                    value != excluded.value\n                    OR :timestamp - last_updated > :min_update_delta\n            ', {'ieee': 00:17:88:01:0d:54:17:3b, 'endpoint_id': 11, 'cluster_id': 768, 'attrid': 7, 'value': 397, 'timestamp': 1704659653.928837, 'min_update_delta': 30.0})
2024-01-07 21:34:13.949 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, '\n            INSERT INTO attributes_cache_v12\n            VALUES (:ieee, :endpoint_id, :cluster_id, :attrid, :value, :timestamp)\n                ON CONFLICT (ieee, endpoint_id, cluster, attrid) DO UPDATE\n                SET value=excluded.value, last_updated=excluded.last_updated\n                WHERE\n                    value != excluded.value\n                    OR :timestamp - last_updated > :min_update_delta\n            ', {'ieee': 00:17:88:01:0d:54:17:3b, 'endpoint_id': 11, 'cluster_id': 768, 'attrid': 7, 'value': 397, 'timestamp': 1704659653.928837, 'min_update_delta': 30.0}) completed
2024-01-07 21:34:13.950 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method commit of sqlite3.Connection object at 0x7f10647bdb70>)
2024-01-07 21:34:13.950 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method commit of sqlite3.Connection object at 0x7f10647bdb70>) completed
2024-01-07 21:34:14.452 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x8182040001004c0cda81
2024-01-07 21:34:14.453 WARNING (MainThread) [zigpy_deconz.api] Unknown command received: Command(command_id=<CommandId.undefined_0x81: 129>, seq=130, payload=b'\x04\x00\x01\x00L\x0c\xda\x81')
2024-01-07 21:34:15.393 DEBUG (MainThread) [zigpy.application] Feeding watchdog
2024-01-07 21:34:15.394 DEBUG (MainThread) [zigpy_deconz.api] Sending CommandId.device_state{} (seq=9)
2024-01-07 21:34:15.394 DEBUG (MainThread) [zigpy_deconz.uart] Send: 0709000800000000
2024-01-07 21:34:15.504 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x22010049004200287472616e736c6174696f6e4c617965725f696e6974506c6174666f726d29436f6e4265652053746172746564205243415553453a205b30303030303230315d0a0d
2024-01-07 21:34:15.505 WARNING (MainThread) [zigpy_deconz.api] Unknown command received: Command(command_id=<CommandId.undefined_0x22: 34>, seq=1, payload=b'\x00I\x00B\x00(translationLayer_initPlatform)ConBee Started RCAUSE: [00000201]\n\r')
2024-01-07 21:34:15.505 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x0e020007002200
2024-01-07 21:34:15.505 DEBUG (MainThread) [zigpy_deconz.api] Received command CommandId.device_state_changed{'status': <Status.SUCCESS: 0>, 'frame_length': 7, 'device_state': DeviceState(network_state=<NetworkState2.CONNECTED: 2>, device_state=<DeviceStateFlags.APSDE_DATA_REQUEST_FREE_SLOTS_AVAILABLE: 8>), 'reserved': 0} (seq 2)
2024-01-07 21:34:17.197 DEBUG (MainThread) [zigpy_deconz.api] No response to 'CommandId.device_state' command with seq 9
2024-01-07 21:34:17.197 WARNING (MainThread) [zigpy.application] Watchdog failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/zigpy_deconz/api.py", line 586, in _command
    return await fut
           ^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 665, in _watchdog_loop
    await self.watchdog_feed()
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 647, in watchdog_feed
    await self._watchdog_feed()
  File "/usr/local/lib/python3.11/site-packages/zigpy_deconz/zigbee/application.py", line 91, in _watchdog_feed
    await self._api.get_device_state()
  File "/usr/local/lib/python3.11/site-packages/zigpy_deconz/api.py", line 898, in get_device_state
    rsp = await self.send_command(CommandId.device_state)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy_deconz/api.py", line 508, in send_command
    return await self._command(cmd, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy_deconz/api.py", line 585, in _command
    async with asyncio_timeout(COMMAND_TIMEOUT):
  File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError
2024-01-07 21:34:17.203 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: TimeoutError()
2024-01-07 21:34:17.204 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: TimeoutError()
2024-01-07 21:34:17.204 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop
2024-01-07 21:34:17.204 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Shutting down ZHA ControllerApplication
2024-01-07 21:34:17.206 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<function PersistingListener._set_isolation_level.<locals>.<lambda> at 0x7f10646cb1a0>)
2024-01-07 21:34:17.206 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<function PersistingListener._set_isolation_level.<locals>.<lambda> at 0x7f10646cb1a0>) completed
2024-01-07 21:34:17.206 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, 'PRAGMA wal_checkpoint;', [])
2024-01-07 21:34:17.214 DEBUG (MainThread) [zigpy_deconz.api] Serial '/dev/serial/by-id/usb-dresden_elektronik_ConBee_III_DE03190147-if00-port0' connection lost unexpectedly: None
2024-01-07 21:34:17.443 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, 'PRAGMA wal_checkpoint;', []) completed
2024-01-07 21:34:17.444 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<function PersistingListener._set_isolation_level.<locals>.<lambda> at 0x7f1062e1b920>)
2024-01-07 21:34:17.445 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<function PersistingListener._set_isolation_level.<locals>.<lambda> at 0x7f1062e1b920>) completed
2024-01-07 21:34:17.446 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method close of sqlite3.Connection object at 0x7f10647bdb70>)
2024-01-07 21:34:17.448 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method close of sqlite3.Connection object at 0x7f10647bdb70>) completed
2024-01-07 21:34:17.551 WARNING (MainThread) [homeassistant.helpers.dispatcher] Unable to remove unknown dispatcher <bound method GroupProbe._reprobe_group of <homeassistant.components.zha.core.discovery.GroupProbe object at 0x7f1073751b90>>
2024-01-07 21:34:17.576 DEBUG (MainThread) [homeassistant.components.zha.entity] light.tradfri_kast_light: stopped polling during device removal
2024-01-07 21:34:17.576 DEBUG (MainThread) [homeassistant.components.zha.entity] light.keuken_aanrecht_led: stopped polling during device removal
2024-01-07 21:34:17.577 DEBUG (MainThread) [homeassistant.components.zha.entity] light.muurspot_deur_onder_licht: stopped polling during device removal
keithcroshaw commented 8 months ago

No issue tonight if course. Maybe the latest update got it or coincidence. I can update boring success logs if anybody wants.

cameron686 commented 8 months ago

No issue tonight if course. Maybe the latest update got it or coincidence. I can update boring success logs if anybody wants.

Since I upgraded to 2024.1.2, I haven't seen this issue return either. I haven't gone this long without a ZHA reset in quite some time, so I'm hopeful that something in the most recent release fixed whatever was causing this problem.

Strangely, after I upgraded, several of my Leviton in-wall Zigbee switches stopped responding. I wasn't even able to re-pair them without flipping the breaker to power them off and back on. They're been stable since then, though. My Zigbee network also seems a lot more responsive than it was before. I'm keeping my fingers crossed that everything stays this way.

Huug4922 commented 8 months ago

Mine still keeps dropping. Anyone idea's where to look? See my debug logging above.

afbeelding

Updated to lastest Core 2024.1.2. Running on a VM at my Synology NAS.

puddly commented 8 months ago

@Huug4922 (and anyone else commenting) please enable ZHA debug logging and post the full debug log.

Huug4922 commented 8 months ago

@puddly, sure but I had to make the file much smaler. It was 125mb. Everthing before 20:00 is removed.

home-assistant_zha_2024-01-07T21-30-12.065Z_shorted.log

cameron686 commented 8 months ago

Looks like I spoke too soon when I said this was working for me. I happened again last night; I woke up this morning and found that all but 1 or 2 of me Leviton in-wall Zigbee switches and several Sonoff plug-in switches had stopped working. I had to shut off power to all of them again in order to get ZHA to re-pair (otherwise it would just get stuck on "Configuring").

Strangely, every time this happens, it's only line-powered devices that drop off the network. I've never had a battery-powered device drop.

This time, system logs show a watchdog time time-out, not an AppController restart. Unfortunately, I had turned off debug logging as I thought the problem was fixed. I've upgraded to 2024.1.3 and turned debug logging back on. Now it's just a matter of waiting.

slunat commented 8 months ago

I have been having this issue ever since updating to 2024.1.1, and I have also experienced it on 2024.1.2 and today updated to 2024.1.3. The issue still persists. I started the debug logging during the initializing loop so logs from before it started will be missing, but ZHA restarted several times after starting logging so hopefully there is something useful in here that indicates what is causing it to need to reinitialise.

home-assistant_zha_2024-01-13T15-55-37.899Z.log

dmulcahey commented 8 months ago

I have been having this issue ever since updating to 2024.1.1, and I have also experienced it on 2024.1.2 and today updated to 2024.1.3. The issue still persists. I started the debug logging during the initializing loop so logs from before it started will be missing, but ZHA restarted several times after starting logging so hopefully there is something useful in here that indicates what is causing it to need to reinitialise.

home-assistant_zha_2024-01-13T15-55-37.899Z.log

we have a fix for the looping coming soon

cameron686 commented 8 months ago

@dmulcahey @puddly

Log Files Are Here

I think I should have some good log data for you. Today has been an extremely bad day for my Zigbee network. On multiple occasions, all my Zigbee devices became unavailable. It appears as though each instance corresponds with either a "Watchdog Error" or a "Watchdog Timeout" in the system log. The network is also extremely slow to recover; some devices require multiple attempts to switch them on/off before they start responding again.

I'm also now seeing warnings for Zigbee channel utilization, but I have nothing that could be interfering. My WiFi network is not on an overlapping channel, and I don't have neighbors. I did a WiFi scan to be sure; nothing nearby is interfering with my Zigbee network.

As far as I can tell, these are the times of interest (all on 1/13): 04:32-04:39 (multiple resets during this period) 05:12-05:15 (multiple resets during this period) 13:44 14:03 14:27 15:43 16:08

stefan814 commented 7 months ago

I'm having the same issue. Will turn on debugging and share logs. Have also noticed my Bluetooth integration becomes unavailable at the same time.

Has anyone successfully moved to Z2M as a solution?

dmulcahey commented 7 months ago

We have a fix for one of the causes of the excessive reloading. It will be in the next . release.

muller119 commented 7 months ago

i hope it will fix mine too to be sure i will add my debug too home-assistant_zha_2024-01-16T19-03-07.325Z.log

stefan814 commented 7 months ago

My logs here [Uploading home-assistant_zha_2024-01-17T00-51-47.466Z.log…]()

neoback45 commented 7 months ago

Hello! I have the same problem! With sonoff and nas Synology! There is a update with this problem? Logs :

Logger: homeassistant.components.zha.core.cluster_handlers Source: components/zha/core/cluster_handlers/init.py:388 Integration: Zigbee Home Automation (documentation, issues) First occurred: 21:37:25 (1 occurrences) Last logged: 21:37:25 [0xDF2A:1:0x0001]: 'asyncinitialize' stage failed: Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 326, in request return await req.result ^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/util/async.py", line 186, in sem_task return await task ^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/init.py", line 388, in async_initialize await self._get_attributes( File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/init.py", line 490, in _getattributes read, = await self.cluster.read_attributes( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/init.py", line 524, in read_attributes result = await self.read_attributes_raw(to_read, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/init.py", line 377, in request return await self._endpoint.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 325, in request async with asyncio_timeout(timeout): File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in aexit raise TimeoutError from exc_val TimeoutError

Huug4922 commented 7 months ago

Hello! I have the same problem! With sonoff and nas Synology! There is a update with this problem? Logs :

Unfortunately updating HA to 2024.1.5 didn't change anything for me. Using Conbee3 and a VM on my Synology DS920+. See my logs above.

neoback45 commented 7 months ago

Hello! I have the same problem! With sonoff and nas Synology! There is a update with this problem? Logs :

Unfortunately updating HA to 2024.1.5 didn't change anything for me. Using Conbee3 and a VM on my Synology DS920+. See my logs above.

how often does this happen?

puddly commented 7 months ago

@Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release.

neoback45 commented 7 months ago

@Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release.

And for sonoff?

brylee123 commented 7 months ago

And for HUSBZB-1?

On Jan 26, 2024 1:14 AM, neoback45 @.***> wrote:

@Huug4922https://github.com/Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release.

And for sonoff?

— Reply to this email directly, view it on GitHubhttps://github.com/home-assistant/core/issues/107334#issuecomment-1911540491, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFH543S2VRF364NY2ZDLELDYQNCVZAVCNFSM6AAAAABBPHH5D2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJRGU2DANBZGE. You are receiving this because you commented.Message ID: @.***>

cameron686 commented 7 months ago

And for sonoff?

I am also still having issues with my Sonoff controller. The last few updates have improved the issue; I'm not seeing frequent resets one after another, but I am still setting resets roughly once per day; each reset usually corresponds with one or more Zigbee devices needing to be other power-cycled or (more often) re-paired to the network.

I've already posted logs several times, but I can post more if needed.

neoback45 commented 7 months ago

And for sonoff?

I am also still having issues with my Sonoff controller. The last few updates have improved the issue; I'm not seeing frequent resets one after another, but I am still setting resets roughly once per day; each reset usually corresponds with one or more Zigbee devices needing to be other power-cycled or (more often) re-paired to the network.

I've already posted logs several times, but I can post more if needed.

Hello

can you explain how you do it?

r0bb10 commented 7 months ago

same here, in the last days always around the same time.. all the zigbee network goes down for about a minute while the integration reloads, and the hass log freak out with errors in the meantime.. while almost everything goes back online it happened that some devices had to be repaired (mains routers and a couple of endevices).

coordinator is a sonoff-p cc2652

image

slunat commented 7 months ago

Just wanted to say the issue is mostly fixed for me since updating to 2024.1.4, it no longer reinitializes the addon over and over again.

However I am still having the occasional restart, which normally comes with the “NPC entered failed state. Requesting APP controller restart”. From what I’ve seen this is due to loss of connection of the zigbee dongle which is common with network dongles. However I am using a SkyConnect plugged in via USB. My server does run on a VMWare Workstation VM which I pass the device through to so maybe there’s an issue with the passthrough maybe dropping connection, I’m going to move over my install to bare metal soon and see if it improves.

mirceadamian commented 7 months ago

@slunat before 2023.12.1 my HUSBZB-1 never lost connection and I have never seen "NCP entered failed state" message. Something has changed there which is causing issues with all these sticks (I count atleast SkyConnect, Nortek HUSBZB-1, Sonoff and ConBee III). I have even changed the stick from HUSBZB-1 to Sonoff Dongle-P and re-registered all my sensors one by one (I did not migrate the radio) and the situation did not change. So for now I'm back to 2023.11.3, up for 22 days and I have not seen once the message again. I'm running HA in a docker container on a Synology NAS.

I think we have posted logs for all these sticks, not sure how can we help getting to the bottom of this bug. I have a suspicion that this is either something going wrong with the commands towards the controller (something that is causing the controller to become unresponsive) or simply there is to much traffic on the serial port which hogs the communication causing the restart.

neoback45 commented 7 months ago

@slunat before 2023.12.1 my HUSBZB-1 never lost connection and I have never seen "NCP entered failed state" message. Something has changed there which is causing issues with all these sticks (I count atleast SkyConnect, Nortek HUSBZB-1, Sonoff and ConBee III).

I have even changed the stick from HUSBZB-1 to Sonoff Dongle-P and re-registered all my sensors one by one (I did not migrate the radio) and the situation did not change. So for now I'm back to 2023.11.3, up for 22 days and I have not seen once the message again.

I'm running HA in a docker container on a Synology NAS.

I think we have posted logs for all these sticks, not sure how can we help getting to the bottom of this bug.

I have a suspicion that this is either something going wrong with the commands towards the controller (something that is causing the controller to become unresponsive) or simply there is to much traffic on the serial port which hogs the communication causing the restart.

Yes I have a same problem! 2023.11.3 work no problem!

cameron686 commented 7 months ago

However I am still having the occasional restart, which normally comes with the “NPC entered failed state. Requesting APP controller restart”

@slunat I used to see that message frequently until one or two updates ago. It seems like that error has disappeared, but I'm still seeing "Watchdog failure" and "Watchdog heartbeat timeout: TimeoutError()" on a regular basis (twice today, for instance). Are you seeing the watchdog errors as well? Judging from what I'm reading here, it appears that these ZHA reset issues are fairly widespread.

I am also running HA on a VM with USB passthrough; I'm using Proxmox, though, not VMWare. I thought that perhaps the USB passthrough was causing issues, but I'm passing through several other USB devices to different VMs, and they have a much higher data rate than my Zigbee controller. They've been running for months with no interruption. My HA VM also gets USB passthrough for an UPS, and it is rock solid. I don't think USB passthrough is the problem, although I am very curious to see what happens if you decide to switch over to a bare metal installation. I'm reluctant to set up a server just for HA when I have a perfectly good hypervisor already running; the UPS in my server rack is already close to capacity, and more hardware would likely push it over the edge.

cameron686 commented 7 months ago

Hello

can you explain how you do it?

@neoback45 I mentioned a few different things in my comment. What specifically were you wondering how to do?

neoback45 commented 7 months ago

Hello

can you explain how you do it?

@neoback45 I mentioned a few different things in my comment. What specifically were you wondering how to do?

For this :

I am also still having issues with my Sonoff controller. The last few updates have improved the issue; I'm not seeing frequent resets one after another, but I am still setting resets roughly once per day; each reset usually corresponds with one or more Zigbee devices needing to be other power-cycled or (more often) re-paired to the network.

I've already posted logs several times, but I can post more if needed.

tjerkw commented 7 months ago

My ZHA crashed too, all devices are offline. I reported it here (including the log on the moment it broke): https://github.com/home-assistant/core/issues/107200#issuecomment-1919804242

They might be related, my network is also slow. I'm using a Home Assistent Yellow, with multiprotocol disabled. HA Core 2024.1.5

Huug4922 commented 7 months ago

On Jan 26, 2024 1:14 AM, neoback45 @.***> wrote: @Huug4922https://github.com/Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release. >

@brylee123 I just upgraded to 2024.1.6 which completely broke the ZHA integration. It is not loading any more and I had to restore to 2024.1.5.

Here is my debug logging when reloading the integration. Still using Conbee III. error_log.txt

neoback45 commented 7 months ago

On Jan 26, 2024 1:14 AM, neoback45 @.***> wrote: @Huug4922https://github.com/Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release. >

@brylee123 I just upgraded to 2024.1.6 which completely broke the ZHA integration. It is not loading any more and I had to restore to 2024.1.5.

I upgrade to 2024.1.6 and zha work for the moment...

r0bb10 commented 7 months ago

On Jan 26, 2024 1:14 AM, neoback45 @.***> wrote: @Huug4922https://github.com/Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release. >

@brylee123 I just upgraded to 2024.1.6 which completely broke the ZHA integration. It is not loading any more and I had to restore to 2024.1.5.

i also updated and works as before.

mirceadamian commented 7 months ago

I think it is time to think about giving up on ZHA. This is broken for me since 2023.12.1, so about 2 months and so far I did not see much progress fixing the problem. I did not see the owners posting any questions either to help making some progress. So probably it is a dying integration. Sad.

dmulcahey commented 7 months ago

I think it is time to think about giving up on ZHA. This is broken for me since 2023.12.1, so about 2 months and so far I did not see much progress fixing the problem. I did not see the owners posting any questions either to help making some progress. So probably it is a dying integration. Sad.

lol talk about dramatic. If you don’t see us working on these tickets then you clearly aren’t paying attention.

Anyway, IF you are on a SI stick we may have found something that could help. We are testing a patch now.

mirceadamian commented 7 months ago

That's encouraging. I have a Sonoff ZBDongle-P. Very happy to help if I can.

Huug4922 commented 7 months ago

On Jan 26, 2024 1:14 AM, neoback45 @.***> wrote: @Huug4922https://github.com/Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release. >

@brylee123 I just upgraded to 2024.1.6 which completely broke the ZHA integration. It is not loading any more and I had to restore to 2024.1.5.

i also updated and works as before.

Today I tried again to upgrade to 2024.1.6, still the integration wont load. Hopefully someone can take a look at my debug-logging. home-assistant_zha_2024-02-03T15-59-01.903Z.log

r0bb10 commented 7 months ago

crashes regularly almost once a day, happened in the last hour also..

home-assistant_zha_2024-02-05T10-09-14.097Z.log

i can leave zha debugging enabled if needed and try to record what happens there next time.

artinbastani commented 7 months ago

I was on a nortek husbzb-1 and was crashing like 10 times a day. So I bought and migrated over to a skyconnect. I crash less now, maybe once or twice a day, but still crashing. Ive attached my logs

zha.log

puddly commented 7 months ago

@r0bb10 Please enable debug logging for the entire duration.

@artinbastani In your log file, I also see no crashing. Just devices reporting power usage.