home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.31k stars 30.62k forks source link

Coordinator crash when adding a device #82004

Closed mguaylam closed 1 year ago

mguaylam commented 1 year ago

The problem

Issue

Coordinator crash when a new devices joins or re-join the network.

ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart

Steps to reproduce the issue

Add a new device to the network or reapply mains power to it.

Controller fails immediately. May take more or less time for Home-Assistant to be able to reset the controller. Can end up stuck indefinitely when adding a new device until it is reset. Usually helps when Home-Assistant is restarted entirely.

Effect

Philips Hue lights start a flashing patern in sync while the controler is failed. Similar with IKEA devices but stroboscopic effect. The coordinator is not reachable thus, it is not possible to control devices.

ezgif-4-fa36b502b4

Coordinator

Generic Aliexpress EFR32MG21 : https://www.aliexpress.com/item/1005003578599189.html?spm=a2g0o.order_list.0.0.46f01802MxU13p

Hue light

What version of Home Assistant Core has the issue?

core-2022.11.2

What type of installation are you running?

Home Assistant Container

Integration causing the issue

ZHA

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/

Diagnostics information

home-assistant.log Here’s my network backup as I wonder if something is not wrong with it : ZHA backup 2022-11-12T16-58-07.216Z.txt

Example YAML snippet

zha:
  custom_quirks_path: /config/quirks/
  zigpy_config:
    #source_routing: true
    ezsp_config:
      #CONFIG_SOURCE_ROUTE_TABLE_SIZE: 150 // Used to have source routing but disabling it does not change the behaviour.
      CONFIG_APS_ACK_TIMEOUT: 8000
      CONFIG_ADDRESS_TABLE_SIZE: 8
      CONFIG_APS_UNICAST_MESSAGE_COUNT: 12
    ota:
      ikea_provider: true
      otau_directory: /config/OTAU/
home-assistant[bot] commented 1 year ago

Hey there @dmulcahey, @adminiuga, @puddly, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `zha` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Change the title of the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zha` Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

MattWestb commented 1 year ago

Its the first gen SonOff MG21 coordinator and it shall working with the standard firmware for the repro.

Is the network long in production or is in new formed ? Its sound for my its somthing bad stored in the key storage in the flash that is being triggered then the coordinator is trying adding new devices in the tocken storage (it shall not being needed / done after forming the network the bellows is hashed TC-Link Keys).

Sonoff have making one hot fix then have running EZSP 7.X that is extending the NVM for token storage and going back to EZSP 6.10.3.0 or earlier is the coordinator crashing. They is trying making one updated firmware but as hot fix they have making one GBL file that is deleting the NVM file (writion it over so the NCP is making one new clean one) and 6.x firmware is working OK.

If you is not having to many device try the fix by flashing the https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/nvm3_initfile.gbl. I think it shall being safe letting ZHA restoring the network after have flashing it.

The issue and the GBL file https://github.com/xsp1989/zigbeeFirmware/issues/28#issuecomment-1279652738.

mguaylam commented 1 year ago

Hey there @MattWestb ! Thank you for looking into my issue! 😃 To answer your questions : my network has been for a long time in production, now over 2 years. It started with a HUSBZB-1 but that obviously did not cut it out as the network grew. That’s where I went to the EFR32MG21 chipset about 6 months ago and saw the issue first appear 1 month ago when I purchased a new Philips Hue bulb.

For the NVM, is it something that is written by the network backup? Because I can observe this issue with 2 different coordinators with different firmware provider. I can destroy the NVM portion in question with the proposed firmware but considering I see this issue with 2 different coordinator, could that be involved in the issue? If I erase this portion, what would be the consequences?

I can do a network analysis if needed but for now, I’m not entirely sure where the issue reside. Would it definitively be the coordinator or Home-Assistant could be involved? I overcame lot’s of issues on my network over the time but this one hits me quite hard as it renders my whole network unusable and I can’t pinpoint where the issue reside. It also can happen when you don’t expect it with the problem that lights start to flash as well.

Adminiuga commented 1 year ago

You need to provide the debug logs. The very same chipset just works fine with Yellow and SkyBlue hardware. So it leaves two possibilities:

mguaylam commented 1 year ago

I posted this in the initial post but I might be misunderstanding what is meant by debug log, sorry if it is not what you are asking for.

home-assistant(1).log

This was generated with the following :

logger:
  default: info
  logs:
    homeassistant.core: debug
    homeassistant.components.zha: debug
    bellows.zigbee.application: debug
    bellows.ezsp: debug
    zigpy: debug
    zigpy_deconz.zigbee.application: debug
    zigpy_deconz.api: debug
    zigpy_xbee.zigbee.application: debug
    zigpy_xbee.api: debug
    zigpy_zigate: debug
    zigpy_znp: debug
    zhaquirks: debug

If you need other occurrences, I have several to look at. 😸

Adminiuga commented 1 year ago

The dongle crashes with RESET_ASSERT, zha starts initialization and in middle of re-initialization it gets another assert error from the dongle. And stops responding. Don't know if they have a newer firmware, I'd try that first.

puddly commented 1 year ago

Can you try this configuration?

zha:
  zigpy_config:
    ezsp_config:
      CONFIG_ADDRESS_TABLE_SIZE:              16  # FW:  32, ZHA:  16
      CONFIG_MULTICAST_TABLE_SIZE:             8  # FW:   8, ZHA:  16
      CONFIG_PACKET_BUFFER_COUNT:            250  # FW: 250, ZHA: 255
      CONFIG_SOURCE_ROUTE_TABLE_SIZE:         16  # FW: 200, ZHA:  16
      CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE:  2  # FW:   0, ZHA:   2

Related: https://github.com/itead/Sonoff_Zigbee_Dongle_Firmware/issues/10

If this indeed fixes the problem and you can reliably reproduce the crash, it would be super helpful if you could help us narrow down which of the config options solves the problem. Something about Sonoff's build of EmberZNet is unstable.

mguaylam commented 1 year ago

So, currently I am using the https://www.aliexpress.com/item/1005003578599189.html?spm=a2g0o.order_list.0.0.46f01802MxU13p with the https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/ncp-uart-sw_v6.10.3_115200.gbl firmware.

Just tried the configuration you gave me. Unfortunately, it still did it. They mention those parameters in the firmware : https://github.com/xsp1989/zigbeeFirmware/tree/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP

Configuration Parameter | Value -- | -- Part | EFR32MG21A020F768IM32 Version | EZSP 6.10.3.0 CTUNE value | 128 Address Table Size | 32 Child Table Size | 32 Source Routes | 200 TX | PB01 RX | PB00

Here is the log : home-assistant(config puddly).log

I can see that the NCP failed : NCP entered failed state. Requesting APP controller restart

But also that there is no memory available for this configuration : Couldn't set EzspConfigId.CONFIG_PACKET_BUFFER_COUNT=250 configuration value: EzspStatus.ERROR_OUT_OF_MEMORY

As for the firmware, I can always try any of the following : https://github.com/xsp1989/zigbeeFirmware/tree/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP There is the 655 (Zigbee EmberZNet 6.5.5) which is on the EZSP v7.

Thank you for your kind help! 😺

MattWestb commented 1 year ago

If installing the linked EZSP 7.X and like going back to one lower version you must erasing the NVM / tocke with the GBL file i meted before or the NCP is crashing.

You can also trying one of Garys cooked for sonoff ZBB that have the same Zigbee-module and shall being pin compatible (its for hacked ZBB without signing that only the ZBB is using and shall working on the sticks 2). https://github.com/grobasoz/zigbee-firmware/tree/master/Sonoff-ZBBridge.

PS. EZSP 6.5.5 is working but is not recommended the its have many bad bugs also with 6.8.X and 6.9.X and working good / recommended ones is 6.10.X and late 6.7.X.

TheJulianJES commented 1 year ago

Does the issue only happen with that particular bulb? Although it very likely isn't the issue, were you ever able to pair it via Bluetooth to your phone? Firmware version 1.76.11 is almost two years old and there should be newer firmware available. I wasn't able to find the firmware for your bulb, as the image_type_id : 65535 you mentioned doesn't exist. But you should be able to update it through the Philips Hue app using your phone.

mguaylam commented 1 year ago

@TheJulianJES at first I believed it was this bulb in particular but then anything I pair (ex : an IKEA bulb) it does the same thing. I was able to connect the Philips Hue bulb to my phone and update it thru Bluetooth, it now says it is at the latest version. It now has the firmware : 1.93.11.

The issue still persist, I am to believe the issue is not from the bulb but rather the coordinator.

Now, how I can diagnose this, it’s a little bit harder, I'm not sure where to look.

I can capture the network activity, but I don’t know if my answers will be there.

One thing I found interesting, someone was kind enough to show me their coordinator backup and I found one difference between mine and his, I have to note that a while back, I did migrate from one coordinator to another with zigpy since it was not yet implemented in Home-Assistant.

I can see that mine is missing this portion :

"stack_specific": {
            "ezsp": {
                "hashed_tclk": "**redacted**"

Would that be normal?

I can see in my first ever back-up the tc_link_key was indeed present. Is it why new devices have such a hard time joining but not older ones?

MattWestb commented 1 year ago

I think your backup is done with normal TC-Link keys stored in the chip NVM / token storage in the flash chip and is getting problems then restoring it on the new coordinator that is using hashed TC-Link keys.

@puddly Is the coordinator restore from one other coordinator (not EZSP) to one EZSP working OK or if the backup of EZSP network was formed with one EZSP that was not using hashed TC-Link keys (old install with EM53X coordinator) and restoring the backup on one new with not formed network in the chip = forming one Hashed TC-Link key network ??

Trying ZHA Toolkit Service: ezsp_clear_keys for deleting the saved TC-Link keys ? (= rejoining all devices that was having TC-Link keys (all ZB3 devices))

dmulcahey commented 1 year ago

You can retroactively apply hashed link key settings. I’ve done it. I’ll dig the commands up later

puddly commented 1 year ago

All you need to do to upgrade to hashed link keys is to click the "Migrate" button and reconfigure the current radio. If you restore the most recent backup, it'll upgrade you to a hashed link key automatically when re-forming the network.

mguaylam commented 1 year ago

@dmulcahey i’d be very happy if that’s what I need to solve my issue! @puddly that is strange, because even if I did migrate several times with the latest Home-Assistant, I did not see the hashed link key appear in my back-up. Is it supposed to?

puddly commented 1 year ago

Ah, I forgot it won't actually perform a restore if the current settings are identical to the new settings.

You will have to leave the current network first, either by:

ZHA will auto-restore in the second scenario.

mguaylam commented 1 year ago

Hey @puddly ! Thanks for helping me out! I just did a reset with zigpy cli but when it formed the network again, the key is still absent from the backup. bash-5.1# zigpy radio --baudrate 115200 ezsp /dev/ttyUSB0 reset

I’m not sure why. "stack_specific": {},

Since I have my original key, could I write it in the backup and restore from it?

puddly commented 1 year ago

Something isn't adding up. Can you post a full debug log of the backup and restore?

$ zigpy -vvv radio --baudrate 115200 ezsp /dev/ttyUSB0 backup -z > backup.json
$ cat backup.json
$ zigpy -vvv radio --baudrate 115200 ezsp /dev/ttyUSB0 restore backup.json
mguaylam commented 1 year ago

zigpy backup.txt backup.txt zigpy restore.txt

puddly commented 1 year ago

Thanks. According to the restore, one was written:

... stack_specific={'ezsp': {'hashed_tclk': 'a2473867c61c6c4e43e764b18dc95164'}} ...

Can you do another backup to confirm?

mguaylam commented 1 year ago

Sooo strange. Now it did :

        "stack_specific": {
            "ezsp": {
                "hashed_tclk": "a2473867c61c6c4e43e764b18dc95164"
            }
        },

Is something broken in Zigpy?

puddly commented 1 year ago

The exact same code is used by network formation, backup restoration, and the ZHA config flow so I think either the original network was never cleared or your browser may have cached the downloaded backup.

Adminiuga commented 1 year ago

Does it still crash?

mguaylam commented 1 year ago

It does. 😞 I removed the power from the newer bulb and re-applied it and the coordinator crashed : NCP entered failed state. Requesting APP controller restart

ControllerApplication reset unsuccessful: TimeoutError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 643, in _reset_controller_loop
    await self._reset_controller()
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 665, in _reset_controller
    await self.initialize()
  File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 76, in initialize
    await self.load_network_info(load_devices=False)
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 257, in load_network_info
    brd_manuf, brd_name, version = await self._get_board_info()
  File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 117, in _get_board_info
    return await self._ezsp.get_board_info()
  File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/__init__.py", line 299, in get_board_info
    (value,) = await self.getMfgToken(token)
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

home-assistant_crash_2022_11_20.log

There is no crash when I do the same for older devices of the network.

I also can’t add new devices to the coordinator following the restore by zigpy cli.

@MattWestb : I’ll try erasing the NVM portion tomorrow as you recommended. You say it is used to store keys?

MattWestb commented 1 year ago

@mguaylam If Puddly is not finding any other way i think flashing the NVM fix is one way also reflashing the EZSP 6.10.3 at the same time you have hocked up i think can being good (but the EZSP first and then the NVM fix).

The NVM fix is writing one empty file over the aria the token storage is in the flash. Then the SOC is booting after the flash is making one new clean NVM that shall being OK and all old tokens is away.

I can see that you have writing one new IEEE then changing from the EM358X coordinator and it shall not being any problems as long the old coordinator is not online in your radio range.

Fast look in the log i finding little strange that the system is reading the manufacture tokens many times (first time its having problems) then the normal is only doing then initializing the coordinator and perhaps then (our Puddly) is doing one new backup. Also its many timeouts but is not easy knowing if its the coordinator or slow system that is making that. Also source-routing is not working OK then the coordinator is not 100% communicating with all device and cant getting router records from not online devices (they is in the network and having the network key but have not syncing the frame counter with the coordinator and the TC-Link key can being wrong).

mguaylam commented 1 year ago

I flashed the 6.10.3 firmware from : https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/ncp-uart-sw_v6.10.3_115200.gbl then https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/nvm3_initfile.gbl but the coordinator was not responding correctly on the serial connection. I did the inverse and it worked.

I still can’t add new devices now, it’s very strange. I also confirmed the coordinator still crash : home-assistant_after_erasing_NVM+crash.log

I indeed came from a EM358X (HUSBZB-1) which I have get rid of a while ago. My system is a full fledged server so I’d be surprised there would be performance issues from there.

puddly commented 1 year ago

Received _reset_controller_application frame with (<NcpResetCode.RESET_ASSERT: 6>,) yet again.

Can you read out the current configuration of your adapter? bellows --baudrate 115200 -d /dev/... config -a.

mguaylam commented 1 year ago
bash-5.1# bellows --baudrate 115200 -d /dev/ttyUSB0 config -a
NOTE: Configuration changes do not persist across resets
CONFIG_PACKET_BUFFER_COUNT=75
CONFIG_NEIGHBOR_TABLE_SIZE=26
CONFIG_APS_UNICAST_MESSAGE_COUNT=20
CONFIG_BINDING_TABLE_SIZE=32
CONFIG_ADDRESS_TABLE_SIZE=32
CONFIG_MULTICAST_TABLE_SIZE=8
CONFIG_ROUTE_TABLE_SIZE=16
CONFIG_DISCOVERY_TABLE_SIZE=8
CONFIG_STACK_PROFILE=0
CONFIG_SECURITY_LEVEL=5
CONFIG_MAX_HOPS=30
CONFIG_MAX_END_DEVICE_CHILDREN=32
CONFIG_INDIRECT_TRANSMISSION_TIMEOUT=3000
CONFIG_END_DEVICE_POLL_TIMEOUT=8
CONFIG_TX_POWER_MODE=0
CONFIG_DISABLE_RELAY=0
CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE=0
CONFIG_SOURCE_ROUTE_TABLE_SIZE=200
CONFIG_FRAGMENT_WINDOW_SIZE=1
CONFIG_FRAGMENT_DELAY_MS=0
CONFIG_KEY_TABLE_SIZE=12
CONFIG_APS_ACK_TIMEOUT=1600
CONFIG_ACTIVE_SCAN_DURATION=3
CONFIG_END_DEVICE_BIND_TIMEOUT=60
CONFIG_PAN_ID_CONFLICT_REPORT_THRESHOLD=2
CONFIG_REQUEST_KEY_TIMEOUT=0
CONFIG_CERTIFICATE_TABLE_SIZE=0
CONFIG_APPLICATION_ZDO_FLAGS=0
CONFIG_BROADCAST_TABLE_SIZE=128
CONFIG_MAC_FILTER_TABLE_SIZE=0
CONFIG_SUPPORTED_NETWORKS=1
CONFIG_SEND_MULTICASTS_TO_SLEEPY_ADDRESS=0
CONFIG_ZLL_GROUP_ADDRESSES=1
CONFIG_ZLL_RSSI_THRESHOLD=128
CONFIG_MTORR_FLOW_CONTROL=1
CONFIG_RETRY_QUEUE_SIZE=16
CONFIG_NEW_BROADCAST_ENTRY_THRESHOLD=122
CONFIG_TRANSIENT_KEY_TIMEOUT_S=300
CONFIG_BROADCAST_MIN_ACKS_NEEDED=255
CONFIG_TC_REJOINS_USING_WELL_KNOWN_KEY_TIMEOUT_S=300
CONFIG_CTUNE_VALUE=128
CONFIG_ASSUME_TC_CONCENTRATOR_TYPE=1
puddly commented 1 year ago

Here are all of the changed config options:

#                                               default  =>  changed
CONFIG_ADDRESS_TABLE_SIZE:                           32  =>    16
CONFIG_MULTICAST_TABLE_SIZE:                          8  =>    16
CONFIG_PACKET_BUFFER_COUNT:                          75  =>   255
CONFIG_SOURCE_ROUTE_TABLE_SIZE:                     200  =>    16
CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE:               0  =>     2

# Timeouts, probably not affecting anything
CONFIG_INDIRECT_TRANSMISSION_TIMEOUT:              3000  =>  7680
CONFIG_TC_REJOINS_USING_WELL_KNOWN_KEY_TIMEOUT_S:   300  =>    90

Can you try resetting the five above to their default values?

Adminiuga commented 1 year ago

@puddly is your default broadcast table size the same? 128 is humongous table size for broadcasts.

And yeah, I would bump the trust center address cache to at least 2, although this is more for overlapping joins.

MattWestb commented 1 year ago

The official firmware parameters for Sonoff EZSP 6.10.3.0 with c-tune 128 for fixing miss-tuned radio : https://github.com/xsp1989/zigbeeFirmware/tree/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP#versions-and-changelog.

mguaylam commented 1 year ago

I’ve just restarted Home-Assistant with those parameters, is that what you asked for? 😄

zha:
  custom_quirks_path: /config/quirks/
  zigpy_config:
    source_routing: true
    ezsp_config:
      CONFIG_ADDRESS_TABLE_SIZE: 32
      CONFIG_MULTICAST_TABLE_SIZE: 8
      CONFIG_PACKET_BUFFER_COUNT: 75
      CONFIG_SOURCE_ROUTE_TABLE_SIZE: 200
      CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE:  0

home-assistant_log_after_changing_coordinator_config.log

Still can’t add new devices. 🫤

puddly commented 1 year ago

The good news is that there is no crash!

Maybe try changing CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 2 as @Adminiuga suggested?

mguaylam commented 1 year ago

Oh! It still crash, sorry, forgot to test it : home-assistant_log_after_changing_coordinator_config+crash.log

With CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 2 :

# ZHA
zha:
  custom_quirks_path: /config/quirks/
  zigpy_config:
    source_routing: true
    ezsp_config:
      CONFIG_ADDRESS_TABLE_SIZE: 32
      CONFIG_MULTICAST_TABLE_SIZE: 8
      CONFIG_PACKET_BUFFER_COUNT: 75
      CONFIG_SOURCE_ROUTE_TABLE_SIZE: 200
      CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 2

home-assistant_log_with_TC_CACHE_2.log

(Can’t add + crash)

puddly commented 1 year ago

For reference, here are the zigpy overrides and the stack defaults for every stick config I am able to find:

Option Zigpy HUSBZB-1 Sonoff Generic EFR32MG21 SkyConnect
ACTIVE_SCAN_DURATION   3 3 3 3
ADDRESS_TABLE_SIZE 16 16 32 32 16
APPLICATION_ZDO_FLAGS ... 0 0 0 0
APS_ACK_TIMEOUT   1600 1600 1600 1600
APS_UNICAST_MESSAGE_COUNT   20 32 20 20
ASSUME_TC_CONCENTRATOR_TYPE   1 1 1 1
BINDING_TABLE_SIZE   32 32 32 32
BROADCAST_MIN_ACKS_NEEDED   255 255 255 255
BROADCAST_TABLE_SIZE   15 35 128 15
CERTIFICATE_TABLE_SIZE   0 0 0  
CTUNE_VALUE   0 133 128 140
DISABLE_RELAY   0 0 0 0
DISCOVERY_TABLE_SIZE   8 8 8 8
END_DEVICE_BIND_TIMEOUT   60 60 60 60
END_DEVICE_POLL_TIMEOUT 8 8 8 8 8
FRAGMENT_DELAY_MS   0 0 0 0
FRAGMENT_WINDOW_SIZE   1 1 1 1
GP_PROXY_TABLE_SIZE         5
INDIRECT_TRANSMISSION_TIMEOUT 7680 3000 3000 3000 3000
KEY_TABLE_SIZE   12 12 12 12
MAC_FILTER_TABLE_SIZE   0 0 0 2
MAX_END_DEVICE_CHILDREN   32 32 32 32
MAX_HOPS   30 30 30 30
MTORR_FLOW_CONTROL   1 1 1 1
MULTICAST_TABLE_SIZE 16 8 8 8 16
NEIGHBOR_TABLE_SIZE   16 26 26 16
NEW_BROADCAST_ENTRY_THRESHOLD   9 29 122 9
PACKET_BUFFER_COUNT 255 64 250 75 255
PAN_ID_CONFLICT_REPORT_THRESHOLD 2 2 2 2 2
REQUEST_KEY_TIMEOUT   0 0 0 0
RETRY_QUEUE_SIZE   16 16 16 16
ROUTE_TABLE_SIZE   16 16 16 16
SECURITY_LEVEL 5 5 5 5 5
SEND_MULTICASTS_TO_SLEEPY_ADDRESS   0 0 0 0
SOURCE_ROUTE_TABLE_SIZE 16 200 200 200 200
STACK_PROFILE 2 0 0 0 0
SUPPORTED_NETWORKS 1 1 1 1 1
TC_REJOINS_USING_WELL_KNOWN_KEY_TIMEOUT_S 90 300 300 300 300
TRANSIENT_KEY_TIMEOUT_S   300 300 300 300
TRUST_CENTER_ADDRESS_CACHE_SIZE 2 0 0 0 0
TX_POWER_MODE   0 0 0 0
ZLL_GROUP_ADDRESSES   1 1 1 0
ZLL_RSSI_THRESHOLD   128 128 128 216
EZSP_GP_SINK_TABLE_SIZE         0
puddly commented 1 year ago

If it's crashing with firmware defaults, I'm kind of tempted to blame the firmware (or merely "changing" a config value to its current value manages to change something). As a last resort, can you try this configuration that uses small table sizes for everything?

CONFIG_BROADCAST_TABLE_SIZE: 15
CONFIG_ADDRESS_TABLE_SIZE: 16
CONFIG_NEIGHBOR_TABLE_SIZE: 16
CONFIG_MULTICAST_TABLE_SIZE: 16
CONFIG_PACKET_BUFFER_COUNT: 75
CONFIG_SOURCE_ROUTE_TABLE_SIZE: 16
CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 2
MattWestb commented 1 year ago

I agree with our Puddly !!

One thing is RF interference that the Sonoff module is very famous to having. If i remember right it was not possible joining devices for many users then this device was released until moving it. Try putting away from all USB3 and other possible RF sources with one USB extension cable. Also putting some aluminum folio around the casing but not the antenna and doing connecting for grounding on the USB connector.

MattWestb commented 1 year ago

Also its over 20 devices that is joining the network then they is having the network key and i think its little too much for the NCP. And i think all of the joining devices need updating there TC-Link key to the hashed one.

Edit: 28 Device_annce in the log.

mguaylam commented 1 year ago

Just to make sure, I putted an USB extension cable, my server is already USB 2 so it was not a big worries for me from the start.

Trying to pair new devices still does not work and the coordinator also crash upon new devices rejoining. What’s weird is that I still get this issue with either my (currently) generic easyIoT or the SonOff-E. On the EasyIoT I also tried either 6.10.3 or 6.7.9 with the same result.

home-assistant_log_small_table_cant_pair+crash.log

zha:
  custom_quirks_path: /config/quirks/
  zigpy_config:
    source_routing: true
    ezsp_config:
      CONFIG_BROADCAST_TABLE_SIZE: 15
      CONFIG_ADDRESS_TABLE_SIZE: 16
      CONFIG_NEIGHBOR_TABLE_SIZE: 16
      CONFIG_MULTICAST_TABLE_SIZE: 16
      CONFIG_PACKET_BUFFER_COUNT: 75
      CONFIG_SOURCE_ROUTE_TABLE_SIZE: 16
      CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 2

I also tried re-flashing, create a new network then restoring the old network but the only devices I can add are the ones that we’re already in the network.

MattWestb commented 1 year ago

With the SOnOff-E one USB extension cable shall being more then enough with the easyIOT its little tricky but i think the problems in not there. Is all your coordinators using the same IEEE after all up and downgrading ? If not it can being that ZHA is having 2 devices with address 0x0000 and the inactive must being deleted and ZHA restarted for working OK.

The log is not the NCP restarting so much and only one device is trying joining (before 28) but ZHA is doing one backup at the time and also the coordinator need doing unicast to the joining device and that is not working then the routing is messed up completely.

As resolute we see in the counters from bellows:

MAC_TX_UNICAST_SUCCESS = 436, MAC_TX_UNICAST_RETRY = 75, MAC_TX_UNICAST_FAILED = 11,
APS_DATA_TX_UNICAST_SUCCESS = 253, APS_DATA_TX_UNICAST_RETRY = 75, APS_DATA_TX_UNICAST_FAILED = 24,
ROUTE_DISCOVERY_INITIATED = 11, NEIGHBOR_ADDED = 54, NEIGHBOR_REMOVED = 22,
NWK_DECRYPTION_FAILURE = 74, APS_DECRYPTION_FAILURE = 58,
TYPE_NWK_RETRY_OVERFLOW = 28, PHY_CCA_FAIL_COUNT = 9,

I think first you need getting all devices you is having in the system updating there security so the normal and source-routing is working OK before adding new device then the coordinator cant sending all commands to the joining device with unicast then the security is broken in most device also the APS is needed for updating the router table and also source-routing commands.

Try open the network for joining and resetting one router that is being direct connected to the coordinator and trying getting it OK joined. If its going OK rejoining more routers thru the first OK one so the they is getting there TC-Link key OK. I think using other not OK routers can working but is very likely getting problems with unicast and router requests not working.

One other way is setting up one new HA and ZHA from scratch and moving router by router and then all end devices but i think you is having more then some devices so its not first chose. Also only moving some devices for see its working (im 100% sure it do !!).

puddly commented 1 year ago

Are you crashing with the Sonoff-E running ITEAD's stock firmware?

MattWestb commented 1 year ago

= the problem is not the coordinator its the network that cant routing traffic and then the paring is not working to routers then unicast is not working so the coordinator cant talking to the new device.

mguaylam commented 1 year ago

@puddly yes, I crash with the Itead stock firmware as well. 😣

Adminiuga commented 1 year ago

is ITEAD crashing with the same RESET ASSERT error from coordinator?

mguaylam commented 1 year ago

Yes. 😞

Adminiuga commented 1 year ago

Hrm, 🤔

What hardware are you running it on? Have you restored the network backup on ITead or did you form a new network?

mguaylam commented 1 year ago

It is a HPE Tower Server running Fedora with pods built on Podman. It only has USB 2 ports. I tried an extension cable this morning to be sure it wasn’t that. My Wi-Fi access point is on channel 11 and my ZigBee network is on channel 15. No overlapping.

The coordinator I'm using is the easyIoT on v.6.10.3 built by xsp1989 but I tried the SonOff-E coordinator with ITEAD v.6.10.3 firmware as well in case the coordinator itself was the issue.

I also tried the easyIoT with v.6.7.9 built by xsp1989 in the hope it would be a bug in the 6.10.3 firmware.

As for restoring, I did a restore on both of these coordinators, never a new network. I tried a form a new network on the easyIoT before restoring but the result was the same.

mguaylam commented 1 year ago

I think I'm onto something. I need to analyze more; I can send also the capture file in private. But first thing I see, when I unscrew and re-screw the bulb as shown in the GIF after it tries to join the network (0xbd0f) : image

Then I can see an insane amount of network conflicts coming from my 4 Sinope thermostats and it goes very fast, i’m wondering if that’s what crash my coordinator : image

The thing is, now I’m wondering if my issues started exactly after I added those thermostats to my network. But I never heard of anyone having such issue with their Sinope thermostats. What is also strange is that those address conflicts only comes from my Sinope thermostats, nothing else.

Then I see a lot of beacons and I’m not sure where they are coming from, I see some from smart meters from my electricity company (but only a few) : image

Then a non-tree link failure from my unscrewed bulb : image

Then the network seem’s to start : image image

Do Philips bulbs flash like that when there is an address conflict? Would the address conflict broadcast-ed by the thermostats be the reason the coordinator crash?

MattWestb commented 1 year ago

If i remember right is old HUE blinking then there is rejoining the network (Its in the black box of bad Zigbee devices then is not deleting children that have jumping / leave the network).

I think its not the coordinator that is the problem for the moment i think its the network that is stalling then so many device is out of sync (all is having the network key but the frame counter is not OK and cant updating all things then the TC-Link key is new for all devices). With so much address and routing problems and the mesh network cant syncing / healing. Is the TRVs router, Sleeping or no sleeping end devices ? If end devices they is not making routing at all only talking with there parents. But as with my HUE issue is they have jumping and the HUE lights is reporting them belonging to more then one routers then you is having one real address conflict (of HUEs).

If i was you i should setting up one new network with different PAN-ID, Extended PAN-ID and network key on one other Chanel (channel is not one must but is more safe) and adding / moving only some devices and see if they is working OK. If sniffing you can see if you is getting strange routing problems or network conflicts in the mesh.

One other variant is shutting down all devices and power on only some at the time and trying getting them working / syncing but i think its the same work if all is going well but is much more if not working OK compared with doing one new network.

Also all devices need getting the new TC-Link key and must being resettled for getting it = the same work as making one new network.

MattWestb commented 1 year ago

https://community.silabs.com/s/question/0D51M00007xeFzISAU/what-is-nontree-link-failure-0x02?language=en_US