Closed mguaylam closed 1 year ago
Hey there @dmulcahey, @adminiuga, @puddly, mind taking a look at this issue as it has been labeled with an integration (zha
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
zha documentation zha source (message by IssueLinks)
Its the first gen SonOff MG21 coordinator and it shall working with the standard firmware for the repro.
Is the network long in production or is in new formed ? Its sound for my its somthing bad stored in the key storage in the flash that is being triggered then the coordinator is trying adding new devices in the tocken storage (it shall not being needed / done after forming the network the bellows is hashed TC-Link Keys).
Sonoff have making one hot fix then have running EZSP 7.X that is extending the NVM for token storage and going back to EZSP 6.10.3.0 or earlier is the coordinator crashing. They is trying making one updated firmware but as hot fix they have making one GBL file that is deleting the NVM file (writion it over so the NCP is making one new clean one) and 6.x firmware is working OK.
If you is not having to many device try the fix by flashing the https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/nvm3_initfile.gbl. I think it shall being safe letting ZHA restoring the network after have flashing it.
The issue and the GBL file https://github.com/xsp1989/zigbeeFirmware/issues/28#issuecomment-1279652738.
Hey there @MattWestb ! Thank you for looking into my issue! 😃 To answer your questions : my network has been for a long time in production, now over 2 years. It started with a HUSBZB-1 but that obviously did not cut it out as the network grew. That’s where I went to the EFR32MG21 chipset about 6 months ago and saw the issue first appear 1 month ago when I purchased a new Philips Hue bulb.
For the NVM, is it something that is written by the network backup? Because I can observe this issue with 2 different coordinators with different firmware provider. I can destroy the NVM portion in question with the proposed firmware but considering I see this issue with 2 different coordinator, could that be involved in the issue? If I erase this portion, what would be the consequences?
I can do a network analysis if needed but for now, I’m not entirely sure where the issue reside. Would it definitively be the coordinator or Home-Assistant could be involved? I overcame lot’s of issues on my network over the time but this one hits me quite hard as it renders my whole network unusable and I can’t pinpoint where the issue reside. It also can happen when you don’t expect it with the problem that lights start to flash as well.
You need to provide the debug logs. The very same chipset just works fine with Yellow and SkyBlue hardware. So it leaves two possibilities:
I posted this in the initial post but I might be misunderstanding what is meant by debug log, sorry if it is not what you are asking for.
This was generated with the following :
logger:
default: info
logs:
homeassistant.core: debug
homeassistant.components.zha: debug
bellows.zigbee.application: debug
bellows.ezsp: debug
zigpy: debug
zigpy_deconz.zigbee.application: debug
zigpy_deconz.api: debug
zigpy_xbee.zigbee.application: debug
zigpy_xbee.api: debug
zigpy_zigate: debug
zigpy_znp: debug
zhaquirks: debug
If you need other occurrences, I have several to look at. 😸
The dongle crashes with RESET_ASSERT
, zha starts initialization and in middle of re-initialization it gets another assert error from the dongle. And stops responding.
Don't know if they have a newer firmware, I'd try that first.
Can you try this configuration?
zha:
zigpy_config:
ezsp_config:
CONFIG_ADDRESS_TABLE_SIZE: 16 # FW: 32, ZHA: 16
CONFIG_MULTICAST_TABLE_SIZE: 8 # FW: 8, ZHA: 16
CONFIG_PACKET_BUFFER_COUNT: 250 # FW: 250, ZHA: 255
CONFIG_SOURCE_ROUTE_TABLE_SIZE: 16 # FW: 200, ZHA: 16
CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 2 # FW: 0, ZHA: 2
Related: https://github.com/itead/Sonoff_Zigbee_Dongle_Firmware/issues/10
If this indeed fixes the problem and you can reliably reproduce the crash, it would be super helpful if you could help us narrow down which of the config options solves the problem. Something about Sonoff's build of EmberZNet is unstable.
So, currently I am using the https://www.aliexpress.com/item/1005003578599189.html?spm=a2g0o.order_list.0.0.46f01802MxU13p with the https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/ncp-uart-sw_v6.10.3_115200.gbl firmware.
Just tried the configuration you gave me. Unfortunately, it still did it. They mention those parameters in the firmware : https://github.com/xsp1989/zigbeeFirmware/tree/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP
Configuration Parameter | Value -- | -- Part | EFR32MG21A020F768IM32 Version | EZSP 6.10.3.0 CTUNE value | 128 Address Table Size | 32 Child Table Size | 32 Source Routes | 200 TX | PB01 RX | PB00Here is the log : home-assistant(config puddly).log
I can see that the NCP failed :
NCP entered failed state. Requesting APP controller restart
But also that there is no memory available for this configuration :
Couldn't set EzspConfigId.CONFIG_PACKET_BUFFER_COUNT=250 configuration value: EzspStatus.ERROR_OUT_OF_MEMORY
As for the firmware, I can always try any of the following : https://github.com/xsp1989/zigbeeFirmware/tree/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP There is the 655 (Zigbee EmberZNet 6.5.5) which is on the EZSP v7.
Thank you for your kind help! 😺
If installing the linked EZSP 7.X and like going back to one lower version you must erasing the NVM / tocke with the GBL file i meted before or the NCP is crashing.
You can also trying one of Garys cooked for sonoff ZBB that have the same Zigbee-module and shall being pin compatible (its for hacked ZBB without signing that only the ZBB is using and shall working on the sticks 2). https://github.com/grobasoz/zigbee-firmware/tree/master/Sonoff-ZBBridge.
PS. EZSP 6.5.5 is working but is not recommended the its have many bad bugs also with 6.8.X and 6.9.X and working good / recommended ones is 6.10.X and late 6.7.X.
Does the issue only happen with that particular bulb? Although it very likely isn't the issue, were you ever able to pair it via Bluetooth to your phone? Firmware version 1.76.11
is almost two years old and there should be newer firmware available.
I wasn't able to find the firmware for your bulb, as the image_type_id : 65535
you mentioned doesn't exist.
But you should be able to update it through the Philips Hue app using your phone.
@TheJulianJES at first I believed it was this bulb in particular but then anything I pair (ex : an IKEA bulb) it does the same thing. I was able to connect the Philips Hue bulb to my phone and update it thru Bluetooth, it now says it is at the latest version. It now has the firmware : 1.93.11.
The issue still persist, I am to believe the issue is not from the bulb but rather the coordinator.
Now, how I can diagnose this, it’s a little bit harder, I'm not sure where to look.
I can capture the network activity, but I don’t know if my answers will be there.
One thing I found interesting, someone was kind enough to show me their coordinator backup and I found one difference between mine and his, I have to note that a while back, I did migrate from one coordinator to another with zigpy since it was not yet implemented in Home-Assistant.
I can see that mine is missing this portion :
"stack_specific": {
"ezsp": {
"hashed_tclk": "**redacted**"
Would that be normal?
I can see in my first ever back-up the tc_link_key was indeed present. Is it why new devices have such a hard time joining but not older ones?
I think your backup is done with normal TC-Link keys stored in the chip NVM / token storage in the flash chip and is getting problems then restoring it on the new coordinator that is using hashed TC-Link keys.
@puddly Is the coordinator restore from one other coordinator (not EZSP) to one EZSP working OK or if the backup of EZSP network was formed with one EZSP that was not using hashed TC-Link keys (old install with EM53X coordinator) and restoring the backup on one new with not formed network in the chip = forming one Hashed TC-Link key network ??
Trying ZHA Toolkit Service: ezsp_clear_keys
for deleting the saved TC-Link keys ?
(= rejoining all devices that was having TC-Link keys (all ZB3 devices))
You can retroactively apply hashed link key settings. I’ve done it. I’ll dig the commands up later
All you need to do to upgrade to hashed link keys is to click the "Migrate" button and reconfigure the current radio. If you restore the most recent backup, it'll upgrade you to a hashed link key automatically when re-forming the network.
@dmulcahey i’d be very happy if that’s what I need to solve my issue! @puddly that is strange, because even if I did migrate several times with the latest Home-Assistant, I did not see the hashed link key appear in my back-up. Is it supposed to?
Ah, I forgot it won't actually perform a restore if the current settings are identical to the new settings.
You will have to leave the current network first, either by:
pip install git+https://github.com/zigpy/zigpy-cli.git && zigpy radio --baudrate 115200 ezsp /dev/... reset
pip install bellows && bellows --baudrate 115200 -d /dev/... leave
ZHA will auto-restore in the second scenario.
Hey @puddly ! Thanks for helping me out!
I just did a reset with zigpy cli but when it formed the network again, the key is still absent from the backup.
bash-5.1# zigpy radio --baudrate 115200 ezsp /dev/ttyUSB0 reset
I’m not sure why.
"stack_specific": {},
Since I have my original key, could I write it in the backup and restore from it?
Something isn't adding up. Can you post a full debug log of the backup and restore?
$ zigpy -vvv radio --baudrate 115200 ezsp /dev/ttyUSB0 backup -z > backup.json
$ cat backup.json
$ zigpy -vvv radio --baudrate 115200 ezsp /dev/ttyUSB0 restore backup.json
Thanks. According to the restore, one was written:
... stack_specific={'ezsp': {'hashed_tclk': 'a2473867c61c6c4e43e764b18dc95164'}} ...
Can you do another backup to confirm?
Sooo strange. Now it did :
"stack_specific": {
"ezsp": {
"hashed_tclk": "a2473867c61c6c4e43e764b18dc95164"
}
},
Is something broken in Zigpy?
The exact same code is used by network formation, backup restoration, and the ZHA config flow so I think either the original network was never cleared or your browser may have cached the downloaded backup.
Does it still crash?
It does. 😞 I removed the power from the newer bulb and re-applied it and the coordinator crashed :
NCP entered failed state. Requesting APP controller restart
ControllerApplication reset unsuccessful: TimeoutError()
Traceback (most recent call last):
File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 643, in _reset_controller_loop
await self._reset_controller()
File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 665, in _reset_controller
await self.initialize()
File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 76, in initialize
await self.load_network_info(load_devices=False)
File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 257, in load_network_info
brd_manuf, brd_name, version = await self._get_board_info()
File "/usr/local/lib/python3.10/site-packages/bellows/zigbee/application.py", line 117, in _get_board_info
return await self._ezsp.get_board_info()
File "/usr/local/lib/python3.10/site-packages/bellows/ezsp/__init__.py", line 299, in get_board_info
(value,) = await self.getMfgToken(token)
File "/usr/local/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
home-assistant_crash_2022_11_20.log
There is no crash when I do the same for older devices of the network.
I also can’t add new devices to the coordinator following the restore by zigpy cli.
@MattWestb : I’ll try erasing the NVM portion tomorrow as you recommended. You say it is used to store keys?
@mguaylam If Puddly is not finding any other way i think flashing the NVM fix is one way also reflashing the EZSP 6.10.3 at the same time you have hocked up i think can being good (but the EZSP first and then the NVM fix).
The NVM fix is writing one empty file over the aria the token storage is in the flash. Then the SOC is booting after the flash is making one new clean NVM that shall being OK and all old tokens is away.
I can see that you have writing one new IEEE then changing from the EM358X coordinator and it shall not being any problems as long the old coordinator is not online in your radio range.
Fast look in the log i finding little strange that the system is reading the manufacture tokens many times (first time its having problems) then the normal is only doing then initializing the coordinator and perhaps then (our Puddly) is doing one new backup. Also its many timeouts but is not easy knowing if its the coordinator or slow system that is making that. Also source-routing is not working OK then the coordinator is not 100% communicating with all device and cant getting router records from not online devices (they is in the network and having the network key but have not syncing the frame counter with the coordinator and the TC-Link key can being wrong).
I flashed the 6.10.3 firmware from : https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/ncp-uart-sw_v6.10.3_115200.gbl then https://github.com/xsp1989/zigbeeFirmware/blob/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP/nvm3_initfile.gbl but the coordinator was not responding correctly on the serial connection. I did the inverse and it worked.
I still can’t add new devices now, it’s very strange. I also confirmed the coordinator still crash : home-assistant_after_erasing_NVM+crash.log
I indeed came from a EM358X (HUSBZB-1) which I have get rid of a while ago. My system is a full fledged server so I’d be surprised there would be performance issues from there.
Received _reset_controller_application frame with (<NcpResetCode.RESET_ASSERT: 6>,)
yet again.
Can you read out the current configuration of your adapter? bellows --baudrate 115200 -d /dev/... config -a
.
bash-5.1# bellows --baudrate 115200 -d /dev/ttyUSB0 config -a
NOTE: Configuration changes do not persist across resets
CONFIG_PACKET_BUFFER_COUNT=75
CONFIG_NEIGHBOR_TABLE_SIZE=26
CONFIG_APS_UNICAST_MESSAGE_COUNT=20
CONFIG_BINDING_TABLE_SIZE=32
CONFIG_ADDRESS_TABLE_SIZE=32
CONFIG_MULTICAST_TABLE_SIZE=8
CONFIG_ROUTE_TABLE_SIZE=16
CONFIG_DISCOVERY_TABLE_SIZE=8
CONFIG_STACK_PROFILE=0
CONFIG_SECURITY_LEVEL=5
CONFIG_MAX_HOPS=30
CONFIG_MAX_END_DEVICE_CHILDREN=32
CONFIG_INDIRECT_TRANSMISSION_TIMEOUT=3000
CONFIG_END_DEVICE_POLL_TIMEOUT=8
CONFIG_TX_POWER_MODE=0
CONFIG_DISABLE_RELAY=0
CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE=0
CONFIG_SOURCE_ROUTE_TABLE_SIZE=200
CONFIG_FRAGMENT_WINDOW_SIZE=1
CONFIG_FRAGMENT_DELAY_MS=0
CONFIG_KEY_TABLE_SIZE=12
CONFIG_APS_ACK_TIMEOUT=1600
CONFIG_ACTIVE_SCAN_DURATION=3
CONFIG_END_DEVICE_BIND_TIMEOUT=60
CONFIG_PAN_ID_CONFLICT_REPORT_THRESHOLD=2
CONFIG_REQUEST_KEY_TIMEOUT=0
CONFIG_CERTIFICATE_TABLE_SIZE=0
CONFIG_APPLICATION_ZDO_FLAGS=0
CONFIG_BROADCAST_TABLE_SIZE=128
CONFIG_MAC_FILTER_TABLE_SIZE=0
CONFIG_SUPPORTED_NETWORKS=1
CONFIG_SEND_MULTICASTS_TO_SLEEPY_ADDRESS=0
CONFIG_ZLL_GROUP_ADDRESSES=1
CONFIG_ZLL_RSSI_THRESHOLD=128
CONFIG_MTORR_FLOW_CONTROL=1
CONFIG_RETRY_QUEUE_SIZE=16
CONFIG_NEW_BROADCAST_ENTRY_THRESHOLD=122
CONFIG_TRANSIENT_KEY_TIMEOUT_S=300
CONFIG_BROADCAST_MIN_ACKS_NEEDED=255
CONFIG_TC_REJOINS_USING_WELL_KNOWN_KEY_TIMEOUT_S=300
CONFIG_CTUNE_VALUE=128
CONFIG_ASSUME_TC_CONCENTRATOR_TYPE=1
Here are all of the changed config options:
# default => changed
CONFIG_ADDRESS_TABLE_SIZE: 32 => 16
CONFIG_MULTICAST_TABLE_SIZE: 8 => 16
CONFIG_PACKET_BUFFER_COUNT: 75 => 255
CONFIG_SOURCE_ROUTE_TABLE_SIZE: 200 => 16
CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 0 => 2
# Timeouts, probably not affecting anything
CONFIG_INDIRECT_TRANSMISSION_TIMEOUT: 3000 => 7680
CONFIG_TC_REJOINS_USING_WELL_KNOWN_KEY_TIMEOUT_S: 300 => 90
Can you try resetting the five above to their default values?
@puddly is your default broadcast table size the same? 128 is humongous table size for broadcasts.
And yeah, I would bump the trust center address cache to at least 2, although this is more for overlapping joins.
The official firmware parameters for Sonoff EZSP 6.10.3.0 with c-tune 128 for fixing miss-tuned radio : https://github.com/xsp1989/zigbeeFirmware/tree/master/firmware/Zigbee3.0_Dongle-NoSigned/EZSP#versions-and-changelog.
I’ve just restarted Home-Assistant with those parameters, is that what you asked for? 😄
zha:
custom_quirks_path: /config/quirks/
zigpy_config:
source_routing: true
ezsp_config:
CONFIG_ADDRESS_TABLE_SIZE: 32
CONFIG_MULTICAST_TABLE_SIZE: 8
CONFIG_PACKET_BUFFER_COUNT: 75
CONFIG_SOURCE_ROUTE_TABLE_SIZE: 200
CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 0
home-assistant_log_after_changing_coordinator_config.log
Still can’t add new devices. 🫤
The good news is that there is no crash!
Maybe try changing CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 2
as @Adminiuga suggested?
Oh! It still crash, sorry, forgot to test it : home-assistant_log_after_changing_coordinator_config+crash.log
With CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 2 :
# ZHA
zha:
custom_quirks_path: /config/quirks/
zigpy_config:
source_routing: true
ezsp_config:
CONFIG_ADDRESS_TABLE_SIZE: 32
CONFIG_MULTICAST_TABLE_SIZE: 8
CONFIG_PACKET_BUFFER_COUNT: 75
CONFIG_SOURCE_ROUTE_TABLE_SIZE: 200
CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 2
home-assistant_log_with_TC_CACHE_2.log
(Can’t add + crash)
For reference, here are the zigpy overrides and the stack defaults for every stick config I am able to find:
Option | Zigpy | HUSBZB-1 | Sonoff | Generic EFR32MG21 | SkyConnect |
---|---|---|---|---|---|
ACTIVE_SCAN_DURATION | 3 | 3 | 3 | 3 | |
ADDRESS_TABLE_SIZE | 16 | 16 | 32 | 32 | 16 |
APPLICATION_ZDO_FLAGS | ... | 0 | 0 | 0 | 0 |
APS_ACK_TIMEOUT | 1600 | 1600 | 1600 | 1600 | |
APS_UNICAST_MESSAGE_COUNT | 20 | 32 | 20 | 20 | |
ASSUME_TC_CONCENTRATOR_TYPE | 1 | 1 | 1 | 1 | |
BINDING_TABLE_SIZE | 32 | 32 | 32 | 32 | |
BROADCAST_MIN_ACKS_NEEDED | 255 | 255 | 255 | 255 | |
BROADCAST_TABLE_SIZE | 15 | 35 | 128 | 15 | |
CERTIFICATE_TABLE_SIZE | 0 | 0 | 0 | ||
CTUNE_VALUE | 0 | 133 | 128 | 140 | |
DISABLE_RELAY | 0 | 0 | 0 | 0 | |
DISCOVERY_TABLE_SIZE | 8 | 8 | 8 | 8 | |
END_DEVICE_BIND_TIMEOUT | 60 | 60 | 60 | 60 | |
END_DEVICE_POLL_TIMEOUT | 8 | 8 | 8 | 8 | 8 |
FRAGMENT_DELAY_MS | 0 | 0 | 0 | 0 | |
FRAGMENT_WINDOW_SIZE | 1 | 1 | 1 | 1 | |
GP_PROXY_TABLE_SIZE | 5 | ||||
INDIRECT_TRANSMISSION_TIMEOUT | 7680 | 3000 | 3000 | 3000 | 3000 |
KEY_TABLE_SIZE | 12 | 12 | 12 | 12 | |
MAC_FILTER_TABLE_SIZE | 0 | 0 | 0 | 2 | |
MAX_END_DEVICE_CHILDREN | 32 | 32 | 32 | 32 | |
MAX_HOPS | 30 | 30 | 30 | 30 | |
MTORR_FLOW_CONTROL | 1 | 1 | 1 | 1 | |
MULTICAST_TABLE_SIZE | 16 | 8 | 8 | 8 | 16 |
NEIGHBOR_TABLE_SIZE | 16 | 26 | 26 | 16 | |
NEW_BROADCAST_ENTRY_THRESHOLD | 9 | 29 | 122 | 9 | |
PACKET_BUFFER_COUNT | 255 | 64 | 250 | 75 | 255 |
PAN_ID_CONFLICT_REPORT_THRESHOLD | 2 | 2 | 2 | 2 | 2 |
REQUEST_KEY_TIMEOUT | 0 | 0 | 0 | 0 | |
RETRY_QUEUE_SIZE | 16 | 16 | 16 | 16 | |
ROUTE_TABLE_SIZE | 16 | 16 | 16 | 16 | |
SECURITY_LEVEL | 5 | 5 | 5 | 5 | 5 |
SEND_MULTICASTS_TO_SLEEPY_ADDRESS | 0 | 0 | 0 | 0 | |
SOURCE_ROUTE_TABLE_SIZE | 16 | 200 | 200 | 200 | 200 |
STACK_PROFILE | 2 | 0 | 0 | 0 | 0 |
SUPPORTED_NETWORKS | 1 | 1 | 1 | 1 | 1 |
TC_REJOINS_USING_WELL_KNOWN_KEY_TIMEOUT_S | 90 | 300 | 300 | 300 | 300 |
TRANSIENT_KEY_TIMEOUT_S | 300 | 300 | 300 | 300 | |
TRUST_CENTER_ADDRESS_CACHE_SIZE | 2 | 0 | 0 | 0 | 0 |
TX_POWER_MODE | 0 | 0 | 0 | 0 | |
ZLL_GROUP_ADDRESSES | 1 | 1 | 1 | 0 | |
ZLL_RSSI_THRESHOLD | 128 | 128 | 128 | 216 | |
EZSP_GP_SINK_TABLE_SIZE | 0 |
If it's crashing with firmware defaults, I'm kind of tempted to blame the firmware (or merely "changing" a config value to its current value manages to change something). As a last resort, can you try this configuration that uses small table sizes for everything?
CONFIG_BROADCAST_TABLE_SIZE: 15
CONFIG_ADDRESS_TABLE_SIZE: 16
CONFIG_NEIGHBOR_TABLE_SIZE: 16
CONFIG_MULTICAST_TABLE_SIZE: 16
CONFIG_PACKET_BUFFER_COUNT: 75
CONFIG_SOURCE_ROUTE_TABLE_SIZE: 16
CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 2
I agree with our Puddly !!
One thing is RF interference that the Sonoff module is very famous to having. If i remember right it was not possible joining devices for many users then this device was released until moving it. Try putting away from all USB3 and other possible RF sources with one USB extension cable. Also putting some aluminum folio around the casing but not the antenna and doing connecting for grounding on the USB connector.
Also its over 20 devices that is joining the network then they is having the network key and i think its little too much for the NCP. And i think all of the joining devices need updating there TC-Link key to the hashed one.
Edit: 28 Device_annce in the log.
Just to make sure, I putted an USB extension cable, my server is already USB 2 so it was not a big worries for me from the start.
Trying to pair new devices still does not work and the coordinator also crash upon new devices rejoining. What’s weird is that I still get this issue with either my (currently) generic easyIoT or the SonOff-E. On the EasyIoT I also tried either 6.10.3 or 6.7.9 with the same result.
home-assistant_log_small_table_cant_pair+crash.log
zha:
custom_quirks_path: /config/quirks/
zigpy_config:
source_routing: true
ezsp_config:
CONFIG_BROADCAST_TABLE_SIZE: 15
CONFIG_ADDRESS_TABLE_SIZE: 16
CONFIG_NEIGHBOR_TABLE_SIZE: 16
CONFIG_MULTICAST_TABLE_SIZE: 16
CONFIG_PACKET_BUFFER_COUNT: 75
CONFIG_SOURCE_ROUTE_TABLE_SIZE: 16
CONFIG_TRUST_CENTER_ADDRESS_CACHE_SIZE: 2
I also tried re-flashing, create a new network then restoring the old network but the only devices I can add are the ones that we’re already in the network.
With the SOnOff-E one USB extension cable shall being more then enough with the easyIOT its little tricky but i think the problems in not there. Is all your coordinators using the same IEEE after all up and downgrading ? If not it can being that ZHA is having 2 devices with address 0x0000 and the inactive must being deleted and ZHA restarted for working OK.
The log is not the NCP restarting so much and only one device is trying joining (before 28) but ZHA is doing one backup at the time and also the coordinator need doing unicast to the joining device and that is not working then the routing is messed up completely.
As resolute we see in the counters from bellows:
MAC_TX_UNICAST_SUCCESS = 436, MAC_TX_UNICAST_RETRY = 75, MAC_TX_UNICAST_FAILED = 11,
APS_DATA_TX_UNICAST_SUCCESS = 253, APS_DATA_TX_UNICAST_RETRY = 75, APS_DATA_TX_UNICAST_FAILED = 24,
ROUTE_DISCOVERY_INITIATED = 11, NEIGHBOR_ADDED = 54, NEIGHBOR_REMOVED = 22,
NWK_DECRYPTION_FAILURE = 74, APS_DECRYPTION_FAILURE = 58,
TYPE_NWK_RETRY_OVERFLOW = 28, PHY_CCA_FAIL_COUNT = 9,
I think first you need getting all devices you is having in the system updating there security so the normal and source-routing is working OK before adding new device then the coordinator cant sending all commands to the joining device with unicast then the security is broken in most device also the APS is needed for updating the router table and also source-routing commands.
Try open the network for joining and resetting one router that is being direct connected to the coordinator and trying getting it OK joined. If its going OK rejoining more routers thru the first OK one so the they is getting there TC-Link key OK. I think using other not OK routers can working but is very likely getting problems with unicast and router requests not working.
One other way is setting up one new HA and ZHA from scratch and moving router by router and then all end devices but i think you is having more then some devices so its not first chose. Also only moving some devices for see its working (im 100% sure it do !!).
Are you crashing with the Sonoff-E running ITEAD's stock firmware?
= the problem is not the coordinator its the network that cant routing traffic and then the paring is not working to routers then unicast is not working so the coordinator cant talking to the new device.
@puddly yes, I crash with the Itead stock firmware as well. 😣
is ITEAD crashing with the same RESET ASSERT error from coordinator?
Yes. 😞
Hrm, 🤔
What hardware are you running it on? Have you restored the network backup on ITead or did you form a new network?
It is a HPE Tower Server running Fedora with pods built on Podman. It only has USB 2 ports. I tried an extension cable this morning to be sure it wasn’t that. My Wi-Fi access point is on channel 11 and my ZigBee network is on channel 15. No overlapping.
The coordinator I'm using is the easyIoT on v.6.10.3 built by xsp1989 but I tried the SonOff-E coordinator with ITEAD v.6.10.3 firmware as well in case the coordinator itself was the issue.
I also tried the easyIoT with v.6.7.9 built by xsp1989 in the hope it would be a bug in the 6.10.3 firmware.
As for restoring, I did a restore on both of these coordinators, never a new network. I tried a form a new network on the easyIoT before restoring but the result was the same.
I think I'm onto something. I need to analyze more; I can send also the capture file in private. But first thing I see, when I unscrew and re-screw the bulb as shown in the GIF after it tries to join the network (0xbd0f) :
Then I can see an insane amount of network conflicts coming from my 4 Sinope thermostats and it goes very fast, i’m wondering if that’s what crash my coordinator :
The thing is, now I’m wondering if my issues started exactly after I added those thermostats to my network. But I never heard of anyone having such issue with their Sinope thermostats. What is also strange is that those address conflicts only comes from my Sinope thermostats, nothing else.
Then I see a lot of beacons and I’m not sure where they are coming from, I see some from smart meters from my electricity company (but only a few) :
Then a non-tree link failure from my unscrewed bulb :
Then the network seem’s to start :
Do Philips bulbs flash like that when there is an address conflict? Would the address conflict broadcast-ed by the thermostats be the reason the coordinator crash?
If i remember right is old HUE blinking then there is rejoining the network (Its in the black box of bad Zigbee devices then is not deleting children that have jumping / leave the network).
I think its not the coordinator that is the problem for the moment i think its the network that is stalling then so many device is out of sync (all is having the network key but the frame counter is not OK and cant updating all things then the TC-Link key is new for all devices). With so much address and routing problems and the mesh network cant syncing / healing. Is the TRVs router, Sleeping or no sleeping end devices ? If end devices they is not making routing at all only talking with there parents. But as with my HUE issue is they have jumping and the HUE lights is reporting them belonging to more then one routers then you is having one real address conflict (of HUEs).
If i was you i should setting up one new network with different PAN-ID, Extended PAN-ID and network key on one other Chanel (channel is not one must but is more safe) and adding / moving only some devices and see if they is working OK. If sniffing you can see if you is getting strange routing problems or network conflicts in the mesh.
One other variant is shutting down all devices and power on only some at the time and trying getting them working / syncing but i think its the same work if all is going well but is much more if not working OK compared with doing one new network.
Also all devices need getting the new TC-Link key and must being resettled for getting it = the same work as making one new network.
The problem
Issue
Coordinator crash when a new devices joins or re-join the network.
ERROR (MainThread) [bellows.ezsp] NCP entered failed state. Requesting APP controller restart
Steps to reproduce the issue
Add a new device to the network or reapply mains power to it.
Controller fails immediately. May take more or less time for Home-Assistant to be able to reset the controller. Can end up stuck indefinitely when adding a new device until it is reset. Usually helps when Home-Assistant is restarted entirely.
Effect
Philips Hue lights start a flashing patern in sync while the controler is failed. Similar with IKEA devices but stroboscopic effect. The coordinator is not reachable thus, it is not possible to control devices.
Coordinator
Generic Aliexpress EFR32MG21 : https://www.aliexpress.com/item/1005003578599189.html?spm=a2g0o.order_list.0.0.46f01802MxU13p
Hue light
What version of Home Assistant Core has the issue?
core-2022.11.2
What type of installation are you running?
Home Assistant Container
Integration causing the issue
ZHA
Link to integration documentation on our website
https://www.home-assistant.io/integrations/zha/
Diagnostics information
home-assistant.log Here’s my network backup as I wonder if something is not wrong with it : ZHA backup 2022-11-12T16-58-07.216Z.txt
Example YAML snippet