Automations issue/ ZHA Network busy errors after migrating to Skyconnect dongle

jason1980p commented 1 year ago

The problem

After migrating to Home Assistant Skyconnect usb dongle I've been running into network busy errors. I currently have the dongle connected to a usb extension cable connected to R-Pie4 .

What version of Home Assistant Core has the issue?

Home Assistant 2023.1.6

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Automation

Link to integration documentation on our website

https://www.home-assistant.io/docs/automation/

Diagnostics information

No response

Example YAML snippet

alias: "Pico: Master Bathroom remote"
description: ""
use_blueprint:
  path: stephack/core-pico.yaml
  input:
    pico_remote: a58ddd4ab05559d05de8267f82dd7c49
    top_on:
      - service: light.turn_on
        data:
          brightness_step_pct: 100
        target:
          entity_id: light.light_unknown_master_bathroom_lights_zha_group_0x0006
    bottom_off_release:
      - service: light.turn_off
        data: {}
        target:
          entity_id: light.light_unknown_master_bathroom_lights_zha_group_0x0006
    up_raise:
      - service: light.turn_on
        data:
          brightness_step_pct: 20
        target:
          entity_id:
            - light.light_unknown_master_bathroom_lights_zha_group_0x0006
    down_lower:
      - service: light.turn_on
        data:
          brightness_step_pct: -20
        target:
          entity_id: light.light_unknown_master_bathroom_lights_zha_group_0x0006

Anything in the logs that might be useful for us?

Logger: homeassistant.components.automation.pico_master_bedroom_remote
Source: components/zha/light.py:292
Integration: Automation (documentation, issues)
First occurred: January 21, 2023 at 8:37:18 PM (7 occurrences)
Last logged: 7:21:23 PM

Pico: Master Bathroom remote: Choose at step 1: choice 1: Choose at step 1: choice 1: Error executing script. Unexpected error for call_service at pos 1: Failed to enqueue message after 3 attempts: <EmberStatus.NETWORK_BUSY: 161>
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/script.py", line 451, in _async_step
    await getattr(self, handler)()
  File "/usr/src/homeassistant/homeassistant/helpers/script.py", line 684, in _async_call_service_step
    await service_task
  File "/usr/src/homeassistant/homeassistant/core.py", line 1755, in async_call
    task.result()
  File "/usr/src/homeassistant/homeassistant/core.py", line 1792, in _execute_service
    await cast(Callable[[ServiceCall], Awaitable[None]], handler.job.target)(
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 213, in handle_service
    await service.entity_service_call(
  File "/usr/src/homeassistant/homeassistant/helpers/service.py", line 678, in entity_service_call
    future.result()  # pop exception if have
  File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 958, in async_request_call
    await coro
  File "/usr/src/homeassistant/homeassistant/helpers/service.py", line 715, in _handle_entity_call
    await result
  File "/usr/src/homeassistant/homeassistant/components/light/__init__.py", line 570, in async_handle_light_on_service
    await light.async_turn_on(**filter_turn_on_params(light, params))
  File "/usr/src/homeassistant/homeassistant/components/zha/light.py", line 978, in async_turn_on
    await super().async_turn_on(**kwargs)
  File "/usr/src/homeassistant/homeassistant/components/zha/light.py", line 292, in async_turn_on
    result = await self._level_channel.move_to_level_with_on_off(
  File "/usr/local/lib/python3.10/site-packages/zigpy/zcl/__init__.py", line 324, in request
    return await self._endpoint.request(
  File "/usr/local/lib/python3.10/site-packages/zigpy/group.py", line 57, in request
    await self.application.send_packet(
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 782, in send_packet
    raise zigpy.exceptions.DeliveryError(
zigpy.exceptions.DeliveryError: Failed to enqueue message after 3 attempts: <EmberStatus.NETWORK_BUSY: 161>

Additional information

No response

johnlento commented 5 months ago

@puddly First of all this firmware is pretty awesome, thank you! I was running a ZBDongle-P but no matter what firmware I would use or what settings I would set I would get Watchdog failures frequently. Sometimes every 10 minutes and it would kill the stick and cause it to reinitialize. I elected to move to the ZBDongle-E just to give the other chipset a chance. When I migrated, I couldn't do anything due to Network Busy, so this firmware is the only think that even remotely allows my coordinator to start and work.

The annoying thing which @evelant pointed out is that groups no longer behave like ZHA groups. With the ZBDongle-P a group command would broadcast and the group would all honor it at the same time, even if it took 60 seconds to propagate the network and HA would not mark the group on until it was. In the ZBDongle-E with this firmware a broadcast is sent, the switch goes to on then to off immediately and generally a few members of the group will turn on while the rest will later or not at all. Multiple off/on commands are required to get the full group of lights to turn on or off and HA reporting always lags. On the other hand the coordinator doesn't crash every 10 minutes so I think its a net win.

My question is whether there are any settings I can do to make that more ZBDongle-P like? Subjectively it does feel like individual devices respond hella fast but ZHA groups are all over the place so I suspect there is something with spamming group broadcasts.

I have 20 ZHA groups varying from varying from 2-9 end devices in each. They are essentially light zones where I then go apply adaptive lighting to. So my groups are getting spammed hard and heavy with broadcasts. Most of my routers are Inovelli Blues which did and may still have issues with Zigbee group broadcasts.

Has anyone solved this issue with this firmware and setup? I am contemplating trying to do light groups but its a lot of work... Also happy to take some PCAP's if I can get some quick guidance as its been a hot minute since I did traffic analysis.

evelant commented 4 months ago

@johnlento That's a very good description of the problem. Exactly what I'm seeing as well.

@puddly I'm not sure how this group problem is happening but I am certain that it was introduced with the new build you provided. This never happened on the previous build. Groups always responded in unison or not at all due to network_busy error. Looking at the config changes you made I have no idea how they could have caused such behavior. Maybe a bug introduced upstream by silabs in the newer sdk? Any thoughts on other possible causes since it seems pretty certain that this was introduced with the new build?

evelant commented 4 months ago

Another possible clue about the group issue @puddly -- it only happens when the command is coming from the coordinator. If I address a group via a binding on an inovelli blue switch all members always respond. Only group commands from the coordinator seem to trigger this partial group response behavior.

puddly commented 4 months ago

If you have a second Silicon Labs coordinator (e.g. a HUSBZB-1 or another SkyConnect), you can use it as a packet capture tool. If you indeed are seeing a difference and can reliably replicate group commands working worse with the tweaked firmware, perform a packet capture on your ZigBee channel for about five minutes and include the group command in there, one with the old firmare and one with the new:

pip install bellows
bellows --baudrate 115200 -d /dev/serial/by-id/your-other-zigbee-stick dump -c 20 -w capture.pcap

Change 20 to your ZigBee network's channel and make sure to include your network key. Both can be found in the ZHA configuration page and in the backup JSON. I can try to take a look at the difference.

TheJulianJES commented 4 months ago

@evelant EMBER_BROADCAST_TABLE_SIZE is likely set to 15 on all your (EZSP) router devices. This value cannot be changed. If there are too many broadcasts in a short amount of time, your routers will not rebroadcast them, basically "voiding" those broadcasts.

However, Z-Stack firmware was modified a long time ago to lift the broadcast limit, like done with this EZSP firmware now. I'm running this without any issues and most Z-Stack users (unknowingly) use an even higher limit, I think. I don't have any negative impacts, only improvements.

I'd set up a network sniffer to see how much broadcast traffic there is on your network. Using adaptive lightning or constantly changing colors of Zigbee group lights just doesn't work and will cause issues. The underlying behavior between Z-Stack and EmberZNet seems to be different, but I doubt there's much we can do about this.

One thing you can try is to position your coordinator more central (in regards to your group lights). I'm not sure this is actually the case, but the routers might be able to hear/honor broadcasts coming from the coordinator (in a better way), even if their "broadcast slots" are already/mostly filled up. Also, make sure the coordinator is on an extension cable, away from interference like 2.4 GHz WiFi APs, USB 3 SSDs, ...

evelant commented 4 months ago

@puddly Unfortunately I just gave my extra silabs coordinator to a friend. I still have a ti zstack dev board somewhere, maybe I can capture with that.

@TheJulianJES I'm pretty certain this group issue arose with the new firmware build and not due to any configuration in my network since it's the only thing that changed. Before the update groups always responded in unison. After the update random group members don't respond. Nothing else changed -- same channel, same devices, same location, same extension cable, same coordinator hardware, same automations. I know it's puzzling since from my understanding of zigbee groups they should either all respond or none respond. I don't know how the new build could possibly be causing this but as far as I can tell all signs point to it being something with the new build and not with my setup/configuration.

TheJulianJES commented 4 months ago

I'm pretty certain this group issue arose with the new firmware build and not due to any configuration in my network since it's the only thing that changed.

It's a combination of both. Increasing EMBER_BROADCAST_TABLE_SIZE only on the coordinator (which is what the new build does) can have an impact on timing and how many broadcasts can be sent. Your whole network configuration seems to have an issue with the increased broadcast table size, but mine does not.

Your routers need to relay the broadcasts, but are seemingly "overwhelmed" by the increased amount of broadcasts that your coordinator can send now (or the tighter timing), because of the EMBER_BROADCAST_TABLE_SIZE change.

evelant commented 4 months ago

Makes sense, thanks. I'm not sure how I could be sending an excessive amount of broadcasts however. I don't have any sort of automation that continually issues commands or otherwise seems like it could flood the network. The most chatty automation is adaptive lighting which only updates every 90 seconds and only if the lights are already on. Most of the time they're off because they're turned on when radar presence sensors in a room detect occupancy. That's why this issue is particularly annoying -- people keep walking into rooms and having only 5 out of 10 bulbs turning on. I'll have to see if I can get a packet capture for more debug information.

dmulcahey commented 4 months ago

The packet capture should show us exactly what’s going on.

Are the radar devices Tuya devices? I’m wondering if the entire network is flooded and it’s not just broadcast issues. I understand the issue wasn’t like this before the new firmware but maybe it’s the straw that broke the camel’s back so to say… that, and we have seen many instances of Tuya devices either completely spamming a network to death or introducing lots of routing issues.

It could be possible that we are overloading some of the routers causing them to crash as well.

MattWestb commented 4 months ago

As sad many times before paying with broadcast table in the coordinator is only making problems and Silabs have locking it i the stack so not possible changing in SS GUI. One intersecting old code snips: https://github.com/yqyunjie/Zigbee-Project/blob/2bb294718ac2652fc98c5de08fb4bbd417680e1a/firmware/EmberZNet/EM35x-EZSP/stack/config/ember-configuration-defaults.h#L394-L401

Also if controller devices is working OK (certificated ones and no tuya or Aqaaras) in the network then the network is OK with all broadcast. If tweaking the coordinator and is getting problems then the network is blocking for broadcast storm as it shall doing for not killing it self. Spamming broadcast is also getting problems with unicasts then route discovery is not working if some routers is blocked => complete network breakdown.

If sniffing look for address 0xffff or some other in the 0xfffX then its broadcasts to different types of network devices and only 9 in 8 seconds can being handles the rest is silent ignored. Also test with source routing and without and see how its looking. Some broadcast for discovery is only one hope so is not spamming do much but its not working if the network is blocking all broadcast the routers is losing its topography knowledge.

evelant commented 4 months ago

I don't have any tuya or other uncertified devices. Routers in my network are Inovelli Blue series switches, Sonoff SNZB-06P radars, Innr AE-270T bulbs, and a few ikea bulbs, all of which should be well behaved.

goncalossilva commented 4 months ago

I have also noticed a massive improvement with the new firmware. I have 81 zigbee devices overall, and one particular automation — closing all blinds at sunset — always failed. I had been struggling with it for months, and none of the typical troubleshooting helped. Some blinds would randomly not close, every single time, and now they all do. I've only noticed a hiccup of two for the past couple of weeks, so a major improvement.

Interestingly, HA still reports the automation as having failed due to EmberStatus.DELIVERY_FAILED: 102, and some blinds appear as “last seen XX hours ago” where XX is a large number (say, 12 hours). But it does work in practice.

Interestingly per the discussion above, my blinds switches are tuya. I wonder if this is related to the flakiness I've experienced. Could it make sense to force them to act as end devices? I certainly have enough routers. Is that possible?

evelant commented 3 months ago

I just ordered a skyconnect so I can use it as a sniffer. I'm wondering if https://github.com/NabuCasa/silabs-firmware-builder/pull/57 might help with the group command issues?

evelant commented 2 months ago

@puddly I switched my network a a zbdongle-p and zigbee2mqtt and the "not all lights in group turn on" is still happening. That rules out zha and ember controllers as sources of the problem. I must have something else in my hass install messing up the command or a device on my network causing problems.

stp-ip commented 2 months ago

For what it's worth.

I ran with a Conbee II on a Raspi 3B and around 230 devices. A few delays here and there, but overall worked. Migrated to a Home Assistant Yellow and the migration failed so readded all devices. This failed so I started a new network from scratch instead. Couldn't reliably add new devices. Initializations kept coming up for devices multiple times sometimes even looping to leaving and joining the network. Sensors and even routers dropping off and almost no actual actions working reliably. I got 3 new Sonoff Dongle E devices as I thought it might be a router issue. Long and behold all Extenders (3 Dongle E and 4 Aeotec Range Extender Zi) have RSSI of under 60. This is a good improvement from before, where it would periodically get worse reception etc.

Still the network was not even barely functioning. Tried forcing a reorg by leaving HA off for a few hours, but that didn't improve anything.

End result is I plugged in the old Conbee II, migrated without issues and everything is working. Thanks to the new Dongle Es no request delays anymore and everything feels a lot snappier (could also be the move to the CM4 instead of the Raspi 3B). Either way. Nothing in the network changed. Same spot, same network settings, same channel, same extenders, same devices. Only difference is moving to the Conbee II. It's a difference of night and day in a matter of a few minutes. Due to devices falling off again and again with the on board chip the Conbee II is handling a lot more devices than the chip ever did.

Not sure logs help much. But I got a small one and a 650MB one, which I can't upload, but happy to provide, if helpful. home-assistant_zha_2024-09-02T19-49-50.444Z.log

evelant commented 2 months ago

I think my problem must be these Innr AE-270T bulbs. I guess they don't respond properly to group commands since it only seems to be happening to those bulbs and it happens with two entirely different coordinators and software stacks. @johnlento any chance you're also having the group issues with Innr bulbs? Or do you get it with different devices?

johnlento commented 2 months ago

Sorry it took so long to get back to you. I swapped to a TubesZB EFR32 MGM24 PoE Coordinator 2024 which is the same chipset as my ZBDongle-E (Silicon Labs) and no matter what I do with ZHA or HA light groups none respond at the same time. The only time I ever got a group to ever respond all at once was when I was on the ZBDongle-P (Texas Instruments) It would flawlessly turn on the entire group at the same time. I have all Sengled bulbs so its not unique to your devices. I can't tell you what firmwre or HA or ZHA build it was since I have been down the Silicon Labs rabbit hole for too long now. I am also on stock firmware for TubesZB now and the network is generally working. Sometimes I do get upwards of 20k messages in the queue and have to powercycle everything so that it can start responding again.

John Miles Lento Jr. | - gpg public key: https://lento.io/08EB5F22.txt | - Sent via RFC 1149...

From: Andrew M @.> Sent: Tuesday, September 3, 2024 14:49 To: home-assistant/core @.> Cc: johnlento @.>; Mention @.> Subject: Re: [home-assistant/core] Automations issue/ ZHA Network busy errors after migrating to Skyconnect dongle (Issue #86411)

I think my problem must be these Innr AE-270T bulbs. I guess they don't respond properly to group commands since it only seems to be happening to those bulbs and it happens with two entirely different coordinators and software stacks. @johnlentohttps://github.com/johnlento any chance you're also having the group issues with Innr bulbs? Or do you get it with different devices?

— Reply to this email directly, view it on GitHubhttps://github.com/home-assistant/core/issues/86411#issuecomment-2327194669, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGHSRTYGTPDPUCZVC4XF2NTZUYACVAVCNFSM6AAAAAAUDJDMSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGE4TINRWHE. You are receiving this because you were mentioned.Message ID: @.***>

puddly commented 2 months ago

Sometimes I do get upwards of 20k messages in the queue and have to powercycle everything so that it can start responding again.

That's very unusual, you should not have 20k enqueued packets.

Can you post a ZHA debug log?

johnlento commented 2 months ago

I will start collecting one, it seems to happen once a week. I am often time queued up past max concurrency and the network gets very laggy.

John Miles Lento Jr. | - gpg public key: https://lento.io/08EB5F22.txt | - Sent via RFC 1149...

From: puddly @.> Sent: Saturday, September 21, 2024 20:26 To: home-assistant/core @.> Cc: johnlento @.>; Mention @.> Subject: Re: [home-assistant/core] Automations issue/ ZHA Network busy errors after migrating to Skyconnect dongle (Issue #86411)

Sometimes I do get upwards of 20k messages in the queue and have to powercycle everything so that it can start responding again.

That's very unusual, you should not have 20k enqueued packets.

Can you post a ZHA debug log?

— Reply to this email directly, view it on GitHubhttps://github.com/home-assistant/core/issues/86411#issuecomment-2365377019, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGHSRT25LJA27DB66QBWY4DZXYFDRAVCNFSM6AAAAAAUDJDMSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRVGM3TOMBRHE. You are receiving this because you were mentioned.Message ID: @.***>

puddly commented 2 months ago

Open a separate issue once you do. Thanks!

evelant commented 2 months ago

@johnlento I have a lot of sengled bulbs as well and have run into similar issues with them. A couple of things that might be helpful:

IIRC sengled bulbs by default set up reporting power usage to the coordinator at a very high rate. This can clog the network easily. I'm not sure if there's a way to turn it off in ZHA but at least with z2m turning off reporting improved things a lot.
Sengled bulbs do not act as routers. Make sure you've got strong router devices to support them. (I use inovelli switches)
Sengled bulbs seem to have wonky firmware and unfortunately AFAIK nobody has had success even contacting sengled to get them to fix firmware issues. IIRC a lot of mine would try to connect directly to the coordinator even if the signal was terrible rather than use a nearby strong router. Also IIRC this behavior led to problems with the max directly connected children setting in the coordinator firmware.

Probably not helpful to your situation but I have completely resolved all of my zigbee issues and have a fast, stable network. What it took was a firmware update for my Innr AE270-T bulbs. After a lot of back and forth with Innr they released a new firmware and after updating my ~40 Innr bulbs my network problems disappeared. This IMO shows that zigbee problems can be the fault of manufacturer firmware totally out of our control. I had similar issues to yours when I had primarily sengled bulbs. Now I only use a few sengled bulbs and mostly Innr and have no problems after their new firmware. I suspect your problems may stem from bad Sengled firmware -- maybe if you're persistent (and lucky) you could prod them into releasing an update to fix them?

johnlento commented 2 months ago

@puddly So I think my network is in that state again, max concurrent requests received and slowly climbing. Was at 524 queued and not at 3,000+. It occurred sometime after 10 sengled bulbs went offline for some reason. I can open a separate issue, but the debug log is like 2GB. Is there something I should trim out and submit? Looking for guidance on how to submit the issue.

home-assistant / core