home-assistant / addons

:heavy_plus_sign: Docker add-ons for Home Assistant
https://home-assistant.io/hassio/
Apache License 2.0
1.52k stars 1.48k forks source link

Silicon Labs Multiprotocol addon failing intermittently causing HAP, Thread and Zigbee issues #3192

Open evakq8r opened 1 year ago

evakq8r commented 1 year ago

Describe the issue you are experiencing

Every few days to a week, the Silicon Labs Multiprotocol addon will stop communicating and indicate 'resource temporarily unavailable'. yet does not make mention of which resource this is. When this addon stops, it breaks my light integration (Nanoleaf via Thread using HAP) and Zigbee sensors via ZHA.

Restarting the addon doesn't fix the issue, nor does restarting home assistant. Usually requires a complete reboot of the host, and even then sometimes it will repeatedly indicate 'resource temporarily unavailable'.

What type of installation are you running?

Home Assistant OS

Which operating system are you running on?

Home Assistant Operating System

Which add-on are you reporting an issue with?

Silicon Labs Multiprotocol

What is the version of the add-on?

2.3.2

Steps to reproduce the issue

Wish I knew, as it fails whenever it wants to (sometimes 3am in the morning, sometimes 5pm in the afternoon).

System Health information

System Information

version core-2023.8.4
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.11.4
os_name Linux
os_version 6.1.37
arch x86_64
timezone Australia/Adelaide
config_dir /config
Home Assistant Community Store GitHub API | ok -- | -- GitHub Content | ok GitHub Web | ok GitHub API Calls Remaining | 5000 Installed Version | 1.32.1 Stage | running Available Repositories | 1335 Downloaded Repositories | 37
AccuWeather can_reach_server | ok -- | -- remaining_requests | 14
Home Assistant Cloud logged_in | false -- | -- can_reach_cert_server | ok can_reach_cloud_auth | ok can_reach_cloud | failed to load: unreachable
Home Assistant Supervisor host_os | Home Assistant OS 11.0.dev20230705 -- | -- update_channel | beta supervisor_version | supervisor-2023.08.3 agent_version | 1.5.1 docker_version | 23.0.6 disk_total | 30.8 GB disk_used | 23.3 GB healthy | true supported | true board | ova supervisor_api | ok version_api | ok installed_addons | Samba share (10.0.2), Network UPS Tools (0.12.0), Matter Server (4.9.0), Silicon Labs Multiprotocol (2.3.2), Mosquitto broker (6.2.1), Zigbee2MQTT (1.32.2-1), Custom deps deployment (1.3.3), Home Assistant Google Drive Backup (0.111.1), ESPHome (2023.8.2), Terminal & SSH (9.7.1), Studio Code Server (5.10.1), PS5 MQTT (1.3.1), Whisper (1.0.0), Piper (1.3.2), Advanced SSH & Web Terminal (15.0.7), Silicon Labs Flasher (0.2.0)
Dashboards dashboards | 3 -- | -- resources | 24 views | 11 mode | storage
Recorder oldest_recorder_run | 16 August 2023 at 08:32 -- | -- current_recorder_run | 25 August 2023 at 20:34 estimated_db_size | 933.92 MiB database_engine | sqlite database_version | 3.41.2

Anything in the Supervisor logs that might be useful for us?

Nothing relating to these issues.

Anything in the add-on logs that might be useful for us?

Logger: bellows.zigbee.application
Source: /usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py:643
First occurred: 10:49:25 (4040 occurrences)
Last logged: 17:00:02

ControllerApplication reset unsuccessful: ConnectionRefusedError(111, "Connect call failed ('172.30.32.1', 9999)")
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 640, in _reset_controller_loop
    await self._reset_controller()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 661, in _reset_controller
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 133, in connect
    self._ezsp = await bellows.ezsp.EZSP.initialize(self.config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 164, in initialize
    await ezsp.connect(use_thread=zigpy_config[conf.CONF_USE_THREAD])
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 181, in connect
    self._gw = await bellows.uart.connect(self._config, self, use_thread=use_thread)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 414, in connect
    protocol, _ = await _connect(config, application)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 385, in _connect
    transport, protocol = await zigpy.serial.create_serial_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/serial.py", line 31, in create_serial_connection
    transport, protocol = await loop.create_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1085, in create_connection
    raise exceptions[0]
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1069, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 973, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 628, in sock_connect
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 668, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('172.30.32.1', 9999)

Additional information

HAOS is hosted in a VM on my Unraid server. My Unraid server is still able to see and interact with the USB device(s) when HAOS fails to. For context:

image

I have the SkyConnect USB as well as a Sonoff Zigbee 3.0 USB Dongle Plus V2” (model “ZBDongle-E”). I mainly use the Skyconnect for everything, and the Sonoff is a recent purchase. Both are flashed with the latest version of the MultiPAN firmware.

I have tried both the stable and beta version of HAOS, no change to the outcome. Sometimes it works for several days, sometimes it fails > 5 times a day.

evakq8r commented 1 year ago

Update:

Only way to restore Home Assistant to a functional state was to unplug the SkyConnect from its USB port, replug, then reboot Unraid. Simply turning the VM for HAOS off and on nor just rebooting the host without the replug was not enough.

FWIW, I have a PCIe USB Hub card in the server, however the SkyConnect USB is in the motherboards USB2.0 port, rather than the hub, to try and limit the hub being a cause.

spartandrew18 commented 1 year ago

@agners I would like to report some similar experience that may not be entirely related to the same error you are receiving. However, I am having similar symptoms. Recently I added my HA SkyConnect to the Apple Border Router thread network via using the TLC Dataset provided from Nanoleaf. I have my devices all on one mesh network and things work great when they are up. For some reason my SkyConnect will just shut off. I believe it is being overloaded somehow. The only solution to the problem is to unplug and replug the device. Below is what Silicon Labs Multiprotocol addon logs look like.

core_silabs_multiprotocol_2023-09-01T00-59-20.370Z.log

BEFORE CRASH OF SKYCONNECT image

AFTER CRASH OF SKYCONNECT image

agners commented 1 year ago

@spartandrew18 it seems that the otbr-agent is continuing to restart in a loop?

Do you maybe have the logs from when things first failed? You can get more logs through the CLI:

ha host logs --boot 0 --lines 100000 -t addon_core_silabs_multiprotocol

You can also redirect into a location where you have access to so you can upload the logs:

ha host logs --boot 0 --lines 100000 -t addon_core_silabs_multiprotocol > /config/multiprotocol.log
evakq8r commented 1 year ago

New Update:

Silicon Labs Multiprotocol addon just crashed:

WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #176 : Property-get::PROP_ENDPOINT_STATE aborted
WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #210 : Could not read endpoint state on the secondary
WARNING in function 'system_send_open_endpoint_ack' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #122 : Failed to acknowledge the open request for endpoint #12. Bad file descriptor

Repeats itself every 100ms, which caused my HA VM to hang in responsiveness. Had to force the VM to reboot from Unraid. The addon attempts to startup after the VM/HAOS reboot, but same thing occurs:

[23:30:56:630555] Info : New client connection using library v4.3.1.0
otbr-agent[212293]: 53d.04:20:11.933 [C] Platform------: CPC endpoint open failed
otbr-agent[212293]: 53d.04:20:11.933 [C] Platform------: mCpcBusSpeed = 115200
otbr-agent[212293]: 53d.04:20:11.933 [C] Platform------: Init() at radio.cpp:121: Failure
close cpc endpoint: Resource temporarily unavailable
[23:30:57] INFO: otbr-agent ended with exit code 1 (signal 0)...
OTBR_FORWARD_INGRESS  all opt    in * out wpan0  ::/0  -> ::/0  
Chain OTBR_FORWARD_INGRESS (0 references)
target     prot opt source               destination         
DROP       all      anywhere             anywhere             PKTTYPE = unicast
DROP       all      anywhere             anywhere             match-set otbr-ingress-deny-src src
ACCEPT     all      anywhere             anywhere             match-set otbr-ingress-allow-dst dst
DROP       all      anywhere             anywhere             PKTTYPE = unicast
ACCEPT     all      anywhere             anywhere            
otbr-ingress-deny-src
otbr-ingress-deny-src-swap
otbr-ingress-allow-dst
otbr-ingress-allow-dst-swap
OTBR_FORWARD_EGRESS  all opt    in wpan0 out *  ::/0  -> ::/0  
Chain OTBR_FORWARD_EGRESS (0 references)
target     prot opt source               destination         
ACCEPT     all      anywhere             anywhere            
[23:30:57] INFO: OTBR firewall teardown completed.
[23:30:57:782940] Info : Client disconnected

Thread and Zigbee (as expected) have failed:-

ZHA:

Logger: zigpy.application
Source: /usr/local/lib/python3.11/site-packages/zigpy/application.py:198
First occurred: 23:25:23 (8 occurrences)
Last logged: 23:30:49

Couldn't start application
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 195, in startup
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 138, in connect
    await ezsp.connect(use_thread=self.config[CONF_USE_THREAD])
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 179, in connect
    self._gw = await bellows.uart.connect(self._config, self, use_thread=use_thread)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 414, in connect
    protocol, _ = await _connect(config, application)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 385, in _connect
    transport, protocol = await zigpy.serial.create_serial_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/serial.py", line 36, in create_serial_connection
    transport, protocol = await loop.create_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1085, in create_connection
    raise exceptions[0]
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1069, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 973, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 628, in sock_connect
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 668, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('172.30.32.1', 9999)

HomeKit Bridge (for the Thread lights via HAP) do not show any useful logs, however all 30 integrations are in constant reload mode.

Rebooting the VM did not resolve the issue. Unraid can still see the NabuCasa Skyconnect plugged in and responding:

image

lsusb -d 10c4:ea60 -v

Bus 005 Device 002: ID 10c4:ea60 Silicon Labs CP210x UART Bridge
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            0 
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0        64
  idVendor           0x10c4 Silicon Labs
  idProduct          0xea60 CP210x UART Bridge
  bcdDevice            1.00
  iManufacturer           1 Nabu Casa
  iProduct                2 SkyConnect v1.0
  iSerial                 3 <redacted>
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength       0x0020
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0 
    bmAttributes         0x80
      (Bus Powered)
    MaxPower              100mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass       255 Vendor Specific Class
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x02  EP 2 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x82  EP 2 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval               0
Device Status:     0x0000
  (Bus Powered)

The fix for this at the moment (this time around anyway) was a full server reboot, but it took two attempts and several manual reboots of the addon to make it work and talk with the USB, which is less than ideal.

For what it's worth, the SkyConnect USB is plugged directly into the motherboards USB port (ASRock Rack X570-2L2T) using the included extension cable that come with the unit, to avoid interference.

Any help with this would be appreciated. Thanks.

spartandrew18 commented 1 year ago

@spartandrew18 it seems that the otbr-agent is continuing to restart in a loop?

Do you maybe have the logs from when things first failed? You can get more logs through the CLI:

ha host logs --boot 0 --lines 100000 -t addon_core_silabs_multiprotocol

You can also redirect into a location where you have access to so you can upload the logs:

ha host logs --boot 0 --lines 100000 -t addon_core_silabs_multiprotocol > /config/multiprotocol.log

multiprotocol.log

agners commented 1 year ago

Thanks for the logs. Some interesting snippets:

Aug 30 04:22:33 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:39.085 [W] Platform------: radio tx timeout
Aug 30 04:22:33 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:39.085 [W] Platform------: RCP failure detected
Aug 30 04:22:33 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:39.085 [W] Platform------: Trying to recover (1/100)
Aug 30 04:22:37 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:43.186 [W] Platform------: Wait for response timeout
Aug 30 04:22:37 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:43.186 [W] Platform------: RCP failure detected
Aug 30 04:22:37 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:43.186 [W] Platform------: Trying to recover (2/100)
Aug 30 04:22:41 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:47.282 [W] Platform------: Wait for response timeout
Aug 30 04:22:41 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:47.282 [W] Platform------: RCP failure detected
Aug 30 04:22:41 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:47.282 [W] Platform------: Trying to recover (3/100)
Aug 30 04:22:45 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:51.378 [W] Platform------: Wait for response timeout
Aug 30 04:22:45 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:51.378 [W] Platform------: RCP failure detected
Aug 30 04:22:45 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:51.378 [W] Platform------: Trying to recover (4/100)
Aug 30 04:22:49 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:55.474 [W] Platform------: Wait for response timeout
Aug 30 04:22:49 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:55.474 [W] Platform------: RCP failure detected
Aug 30 04:22:49 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:55.474 [W] Platform------: Trying to recover (5/100)
Aug 30 04:22:53 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:59.570 [W] Platform------: Wait for response timeout
Aug 30 04:22:53 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:59.570 [W] Platform------: RCP failure detected
Aug 30 04:22:53 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:59.570 [W] Platform------: Trying to recover (6/100)
Aug 30 04:22:55 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12
Aug 30 04:22:55 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1125 : Setting ep#12 in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE
Aug 30 04:22:56 homeassistant addon_core_silabs_multiprotocol[474]: [00:22:55:394752] WARNING : In function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12
Aug 30 04:22:56 homeassistant addon_core_silabs_multiprotocol[474]: [00:22:55:394768] WARNING : In function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1125 : Setting ep#12 in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE
Aug 30 04:22:56 homeassistant addon_core_silabs_multiprotocol[474]: [00:22:55:495051] Info : Client disconnected

It seems the OTBR detects that the radio isn't communicating anymore (it can't receive frames any longer). The CPC daemon similarly detects a "retransmit timeout"

WARNING in function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12

The OTBR then ends up in a restart loop, presumably for the same reason (the radio stopped responding):

Aug 30 04:32:58 homeassistant addon_core_silabs_multiprotocol[474]: [00:32:58] INFO: Starting otbr-agent...
...
Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: [00:32:58:732529] Info : New client connection using library v4.3.1.0
Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #176 : Property-get::PROP_ENDPOINT_STATE aborted
Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #210 : Could not read endpoint state on the secondary
...
Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: close cpc endpoint: Success
Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: [00:32:59] INFO: otbr-agent ended with exit code 1 (signal 0)...

The second error in the log looks very similar.

spartandrew18 commented 1 year ago

Thanks for the logs. Some interesting snippets:


Aug 30 04:22:33 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:39.085 [W] Platform------: radio tx timeout

Aug 30 04:22:33 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:39.085 [W] Platform------: RCP failure detected

Aug 30 04:22:33 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:39.085 [W] Platform------: Trying to recover (1/100)

Aug 30 04:22:37 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:43.186 [W] Platform------: Wait for response timeout

Aug 30 04:22:37 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:43.186 [W] Platform------: RCP failure detected

Aug 30 04:22:37 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:43.186 [W] Platform------: Trying to recover (2/100)

Aug 30 04:22:41 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:47.282 [W] Platform------: Wait for response timeout

Aug 30 04:22:41 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:47.282 [W] Platform------: RCP failure detected

Aug 30 04:22:41 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:47.282 [W] Platform------: Trying to recover (3/100)

Aug 30 04:22:45 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:51.378 [W] Platform------: Wait for response timeout

Aug 30 04:22:45 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:51.378 [W] Platform------: RCP failure detected

Aug 30 04:22:45 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:51.378 [W] Platform------: Trying to recover (4/100)

Aug 30 04:22:49 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:55.474 [W] Platform------: Wait for response timeout

Aug 30 04:22:49 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:55.474 [W] Platform------: RCP failure detected

Aug 30 04:22:49 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:55.474 [W] Platform------: Trying to recover (5/100)

Aug 30 04:22:53 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:59.570 [W] Platform------: Wait for response timeout

Aug 30 04:22:53 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:59.570 [W] Platform------: RCP failure detected

Aug 30 04:22:53 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:59.570 [W] Platform------: Trying to recover (6/100)

Aug 30 04:22:55 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12

Aug 30 04:22:55 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1125 : Setting ep#12 in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE

Aug 30 04:22:56 homeassistant addon_core_silabs_multiprotocol[474]: [00:22:55:394752] WARNING : In function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12

Aug 30 04:22:56 homeassistant addon_core_silabs_multiprotocol[474]: [00:22:55:394768] WARNING : In function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1125 : Setting ep#12 in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE

Aug 30 04:22:56 homeassistant addon_core_silabs_multiprotocol[474]: [00:22:55:495051] Info : Client disconnected

It seems the OTBR detects that the radio isn't communicating anymore (it can't receive frames any longer). The CPC daemon similarly detects a "retransmit timeout"


WARNING in function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12

The OTBR then ends up in a restart loop, presumably for the same reason (the radio stopped responding):


Aug 30 04:32:58 homeassistant addon_core_silabs_multiprotocol[474]: [00:32:58] INFO: Starting otbr-agent...

...

Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: [00:32:58:732529] Info : New client connection using library v4.3.1.0

Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #176 : Property-get::PROP_ENDPOINT_STATE aborted

Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #210 : Could not read endpoint state on the secondary

...

Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: close cpc endpoint: Success

Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: [00:32:59] INFO: otbr-agent ended with exit code 1 (signal 0)...

The second error in the log looks very similar.

Is it something to do with my hardware setup? Or is it possible it can fixed via software?

evakq8r commented 1 year ago

Further update; new crash, same outcome but slight variation in the logs:

ZHA:-

Logger: bellows.zigbee.application
Source: runner.py:186
First occurred: 10:25:19 (5 occurrences)
Last logged: 10:26:19

Watchdog heartbeat timeout: TimeoutError()

Silicon Labs Multiprotocol:-

[10:54:39] INFO: Starting zigbeed...
[10:54:39] INFO: Setup OTBR firewall...
[10:54:39] INFO: Starting otbr-agent...
otbr-agent[95918]: [NOTE]-AGENT---: Running 0.3.0
otbr-agent[95918]: [NOTE]-AGENT---: Thread version: 1.3.0
otbr-agent[95918]: [NOTE]-AGENT---: Thread interface: wpan0
otbr-agent[95918]: [NOTE]-AGENT---: Radio URL: spinel+cpc://cpcd_0?iid=2&iid-list=0
otbr-agent[95918]: [NOTE]-ILS-----: Infra link selected: enp5s0
otbr-agent[95918]: [INFO]-NCP-----: OpenThread log level changed to 5
otbr-agent[95918]: 50d.15:48:26.792 [C] Platform------: CPC init failed. Ensure radio-url argument has the form 'spinel+cpc://cpcd_0?iid=<1..3>'
otbr-agent[95918]: 50d.15:48:26.792 [C] Platform------: Init() at cpc_interface.cpp:96: Failure
close cpc endpoint: Resource temporarily unavailable
[10:54:41] INFO: otbr-agent ended with exit code 1 (signal 0)...

Specifically the following line:

otbr-agent[95918]: 50d.15:48:26.792 [C] Platform------: CPC init failed. Ensure radio-url argument has the form 'spinel+cpc://cpcd_0?iid=<1..3>

I don't believe have seen this specific error, but everything else is the same.

Restarting the SLM addon shows the following logs:

s6-rc: info: service mdns: starting
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service mdns successfully started
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
cont-init: info: running /etc/cont-init.d/check-cpcd-shm.sh
[11:20:14] INFO: Starting mDNS Responder...
Default: mDNSResponder (Engineering Build) (Aug 23 2023 18:27:14) starting
cont-init: info: /etc/cont-init.d/check-cpcd-shm.sh exited 0
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service banner: starting
-----------------------------------------------------------
 Add-on: Silicon Labs Multiprotocol
 Zigbee and OpenThread multiprotocol add-on
-----------------------------------------------------------
 Add-on version: 2.3.2
 You are running the latest version of this add-on.
 System: Home Assistant OS 11.0.dev20230705  (amd64 / qemux86-64)
 Home Assistant Core: 2023.9.0
 Home Assistant Supervisor: 2023.08.3
-----------------------------------------------------------
 Please, share the above information when looking for help
 or support in, e.g., GitHub, forums or the Discord chat.
-----------------------------------------------------------
s6-rc: info: service banner successfully started
s6-rc: info: service universal-silabs-flasher: starting
[11:20:15] INFO: Checking /dev/ttyUSB0 identifying SkyConnect v1.0 from Nabu Casa.
[11:20:15] INFO: Starting universal-silabs-flasher with /dev/ttyUSB0
2023-09-07 11:20:15 homeassistant universal_silabs_flasher.flash[185] INFO Extracted GBL metadata: NabuCasaMetadata(metadata_version=1, sdk_version='4.3.1', ezsp_version=None, ot_rcp_version=None, fw_type=<FirmwareImageType.RCP_UART_802154: 'rcp-uart-802154'>, baudrate=460800)
2023-09-07 11:20:15 homeassistant universal_silabs_flasher.flasher[185] INFO Probing ApplicationType.GECKO_BOOTLOADER at 115200 baud
2023-09-07 11:20:17 homeassistant universal_silabs_flasher.flasher[185] INFO Probing ApplicationType.CPC at 460800 baud
2023-09-07 11:20:21 homeassistant universal_silabs_flasher.flasher[185] INFO Probing ApplicationType.CPC at 115200 baud
2023-09-07 11:20:26 homeassistant universal_silabs_flasher.flasher[185] INFO Probing ApplicationType.CPC at 230400 baud
2023-09-07 11:20:30 homeassistant universal_silabs_flasher.flasher[185] INFO Probing ApplicationType.EZSP at 115200 baud
2023-09-07 11:20:35 homeassistant universal_silabs_flasher.flasher[185] INFO Probing ApplicationType.SPINEL at 460800 baud
Error: Failed to probe running application type
s6-rc: warning: unable to start service universal-silabs-flasher: command exited 1
/run/s6/basedir/scripts/rc.init: warning: s6-rc failed to properly bring all the services up! Check your logs (in /run/uncaught-logs/current if you have in-container logging) for more information.
/run/s6/basedir/scripts/rc.init: fatal: stopping the container.
s6-rc: info: service mdns: stopping
s6-rc: info: service banner: stopping
Default: mDNSResponder (Engineering Build) (Aug 23 2023 18:27:14) stopping
s6-rc: info: service banner successfully stopped
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped
[11:20:40] INFO: mDNS ended with exit code 4 (signal 0)...
s6-rc: info: service mdns successfully stopped

The fix now is a physical unplug and replug of the USB into the server, then a full host restart. Simple reboot of the VM did not change the outcome.

It also looks like others are reporting similar issues (#3193) so it is not an isolated incident.

Devs, please advise what further information you require to resolve/investigate this. Physical host restarts to resolve a VM issue are far from ideal (in my case anyway) and we'd like to be able to assist to resolve this problem as soon as practical.

MattWestb commented 1 year ago

Have some with this problem one ESP module and can try using it as serial port over Ethernet ? It was working before but have not testing latest versions but it shall being fixed if having one separate RCP module connected to the ESP.

puddly commented 1 year ago

If you would like to try experimental firmware for the SkyConnect that should eliminate the need for physical resets (source https://github.com/NabuCasa/silabs-firmware-builder/pull/33), I've attached a pre-compiled copy here: skyconnect_rcp-uart-802154_4.3.1_wd.gbl.zip (zipped so it can be uploaded to GitHub, unzip it first).

If you have any step-by-step information for how to reproduce this issue, it would be very useful!

(CC @kylev, from #3193)

kylev commented 1 year ago

Not much interesting about my setup (city setting, very busy 2.4Ghz band). I'm on 2023.9.2 with a mix of Wifi (Sengled, Cync, generic), zigbee (misc mfg), and 3 Nanoleaf bulbs.

Flashing with the web tool seemed to work, but the probed GBL metadata seems identical during boot. Should your build appear differently?

NabuCasaMetadata(metadata_version=1, sdk_version='4.3.1', ezsp_version=None, ot_rcp_version=None, fw_type=<FirmwareImageType.RCP_UART_802154: 'rcp-uart-802154'>, baudrate=460800)

simulti_firmware_3192.log

puddly commented 1 year ago

Thanks!

The metadata will be identical, especially when probing. This also makes sure the addon doesn't re-flash the bundled firmware.

evakq8r commented 1 year ago

If you would like to try experimental firmware for the SkyConnect that should eliminate the need for physical resets (source NabuCasa/silabs-firmware-builder#33), I've attached a pre-compiled copy here: skyconnect_rcp-uart-802154_4.3.1_wd.gbl.zip (zipped so it can be uploaded to GitHub, unzip it first).

If you have any step-by-step information for how to reproduce this issue, it would be very useful!

(CC @kylev, from #3193)

I've installed the experimental firmware and so far so good. Unfortunately I haven't had the failures again since my last batch of logs, but nothing has changed... so the inconsistencies aren't helping resolve the issue.

As for steps on replicating the issue, there-in lies the problem; I don't actually do anything to trigger the problems, they just happen. I've setup additional logging from the unRAID host in the event the registered USB ID disconnects or does something weird from its perspective, but HA doesn't give any more indication of what caused the failure other than what I provided on original report.

kylev commented 1 year ago

Overnight, the container went into the cpcdcrash-loop with Property-get::PROP_ENDPOINT_STATE aborted and Property-get::PROP_ENDPOINT_STATE aborted errors. I restarted the container, and it entered the probing instead of detection stage and failed.

The new firmware didn't seem to have an effect.

puddly commented 1 year ago

Perfect, thank you for the feedback. The watchdog isn't integrated tightly into CPC so it looks like just the CPC part is crashing, not the whole firmware. I'll post an updated one later next week.

kylev commented 1 year ago

Just to ensure clarity: I had to un-plug and re-plug the SkyConnect. The new firmware didn't seem to have an effect.

Whatever initially breaks with the dongle triggers the cpcd crash loop but leaves the container up. Subsequently restarting the whole container results in the firmware flasher attempting to probe the device at multiple baudrates, failing, and crashing the container (and triggering watchdog restart).

After re-plugging the dongle, the container comes up on the next watchdog restart.

I suspect if you tweak the cpcd finish script to crash the container we'd see 1 cpcd crash followed by many probe failures.

kylev commented 1 year ago

Seems reproducible: same symptoms this morning (crash loop cpcd, restarting container results in firmware flasher probe failures, replug SkyConnect fixes it).

I've just re-flashed the SkyConnect with your build again, just to ensure I didn't make an error last time. Will comment if anything changes.

spartandrew18 commented 1 year ago

Any progress made on the fix?

theblackhole commented 11 months ago

To everyone having this problem, are you all using Nanoleaf devices? Because in my case, it's been a while since I had to restart the addon + unplug-replug my skyconnect. I don't know if it's pure luck because it's random, if it's the latest Nanoleaf firmware update (3.5.41) or the various HA updates I installed but it's stable a the moment.

DunklerPhoenix commented 11 months ago

Nope. Not a single nanoleaf. Only hue, aqara, ikea and some thread devices.

evakq8r commented 11 months ago

To everyone having this problem, are you all using Nanoleaf devices? Because in my case, it's been a while since I had to restart the addon + unplug-replug my skyconnect. I don't know if it's pure luck because it's random, if it's the latest Natoleaf firmware update (3.5.41) or the various HA updates I installed but it's stable a the moment.

Unfortunately that firmware only applies to the Matter over Thread Nanoleaf bulbs. The HomeKit over Thread (non-Matter) bulbs latest firmware is 1.6.49.

I've started to add Matter over Thread Nanoleaf bulbs to my mix of things, and out of the box they are 3.2.0, which the Android Nanoleaf indicates has a 'critical firmware upgrade', pointing to 3.5.37 as the FW in question. They all default to 3.5.41 once upgraded though.

theblackhole commented 11 months ago

it's been a while since I had to restart the addon + unplug-replug my skyconnect.

I spoke too soon, I had to restart/unplug-replug today.

evakq8r commented 11 months ago

it's been a while since I had to restart the addon + unplug-replug my skyconnect.

I spoke too soon, I had to restart/unplug-replug today.

Mine also failed yesterday as well. I've now split my Skyconnect to do just OTBR work (Sonoff ZBDongle-E to do the Zigbee work), as well as put both USB devices on extension cables to minimise interference.

Skyconnect crashed again about 6 hours after the split duties, and needed a replug to a different USB port to come back to life.

puddly commented 11 months ago

Can you describe the Zigbee and Thread devices you have on your networks? Any Zigbee Green Power?

I've been able to replicate a crash and will try to get a firmware out that possibly mitigates it but there may be multiple concurrent bugs here causing issues.

DunklerPhoenix commented 11 months ago

I have the same crashes and have:

Zigbee (values from ZHA): 1x FYRTUR block-out roller blind (IKEA of Sweden) 1x LCT012 (Hue) 3x LCT015 1x lumi.airmonitor.acn01 (aqara) 2x lumi.curtain.acn002 (aqara) 5x lumi.magnet.acn001 (aqara) 1x lumi.remote.b1acn01 (aqara) 1x lumi.sen_ill.mgl01 (xiaomi) 1x lumi.sensor_cube (aqara) 3x lumi.sensor_motion.aq2 (aqara) 2x lumi.vibration.aq1 (aqara) 6x lumi.weather (aqara) 1x Remote Control N2 (IKEA of Sweden) 1x RWL021 (hue) 4x TRADFRI bulb GU10 CWS 345lm (IKEA of Sweden) 2x TRADFRI on/off switch (IKEA of Sweden) 1x TRADFRI open/close remote (IKEA of Sweden) 1x TRADFRI remote control (IKEA of Sweden) 4x TRADFRI Signal Repeater (IKEA of Sweden) 12x TRADFRIbulbE14WWclear250lm (since 10 months not at power, but paired) (IKEA of Sweden) 1x TRADFRIbulbE27WSglobeopal1055lm (IKEA of Sweden) 1x TRADFRIbulbE27WWclear250lm (IKEA of Sweden)

Thread (via Homekit Controller): 8x Eve Thermo 20EBP1701 (eve systems) -> Takes sometimes hours to reconnect after restart of sky connect

Wifi (via Homekit Controller): 1x Netatmo Main Module 1x Indoor Module 1x Outdoor Module

evakq8r commented 11 months ago

Think I'll need to close this issue out as my network topology has changed substantially since first logging. I've now split OTBR and Zigbee duties into two devices (OTBR being Skyconnect, Zigbee being a Sonoff ZBDongle-E until my second Skyconnect turns up). Skyconnect is now OpenThreadRCP only, Sonoff is now Zigbee only.

To answer your question though;

Zigbee: 18x Aqara motion sensors 1x Door lock 2x Zigbee magnetic door sensors 5x Zigbee USB Repeaters

Thread: 5x Nanoleaf Matter over Thread bulbs 26x Nanoleaf Homekit over Thread bulbs

I'm in the midst of replacing the HK bulbs with MoT, for various reasons.

I've removed the multiprotocol addon and reconfigured my Zigbee channel to be elsewhere (multiprotocol required the channels to be the same), and reinstalled the standalone OTBR addon in HACS.

MattWestb commented 11 months ago

@DunklerPhoenix Id the Eve Thermo 20EBP1701 one sleeper / reduced thread device that can only being one child or one full future thread device that can being router / leader ?

DunklerPhoenix commented 11 months ago

I'm not sure if I understand you correctly.

The Eve Thermo is just a child and cant be used as a router or leader. It's battery powered and so I think a reduced thread device. But I'm not sure. sry

https://www.evehome.com/de/eve-thermo

MattWestb commented 11 months ago

@DunklerPhoenix OK do you have FTD device that can "holding" the network alive (leader) then the RCP is not online ? If not the RCP / addon must being the leader but its taking time for the network being OK after it was offline and getting the child getting back in it.

DunklerPhoenix commented 11 months ago

Ahhh. No they cant hold the network alive. And I dont have any other leader than sky Connect

MattWestb commented 11 months ago

Then its working as expected then the MTDs have lost the network and need coming back and syncing to the network and its takes time then its sleepers. By one Nanaoleaf light for 20€ and you have the network living also then SC is offline and they is not losing the network and the SC only need joining the network and its taking some seconds.

Its the same with Zigbee need one backbone of routers for getting the end device working OK and also getting it working then the coordinator is offline (was 3 weeks on holiday with the laptop and all device was online after 60 min then coming back and connecting the coordinator then the network was living all the time only rejoining the coordinator).

DunklerPhoenix commented 11 months ago

but this doesn't explain the random crashes. the sky connect is then not responding and the multiprotocol plugin can't connect to it until the skyconnect gets a fresh start (plug out plug in)

That the network is not working when the sky connect is not there is understandable, but the main problem is before this

puddly commented 11 months ago

@DunklerPhoenix it's not working because there are firmware bugs with the multiprotocol firmware that need to be fixed by Silicon Labs.

If you can describe your setup, what devices you have, and what you can do to reliably trigger the crash, it would really help with fixing the problem much more quickly. I've been able to replicate one bug and Silicon Labs is aware of the issue but there may be other, independent ones.

DunklerPhoenix commented 11 months ago

my devices are listed here: https://github.com/home-assistant/addons/issues/3192#issuecomment-1761824139 But I can't say what triggers the crash. It just happens. I'll think I wait now a bit how silicon labs reacts on the replicated bug and try to find a trigger.

puddly commented 11 months ago

I'll think I wait now a bit how silicon labs reacts on the replicated bug and try to find a trigger.

This can take months, unfortunately, and the one I reported may not be the only one. Any observations are helpful and you can help everyone significantly if you can help us track your bug down!

I'm going to try to compile a firmware over the weekend that omits a feature (Zigbee Green Power) responsible for the bug I noted. If you want to test that out, that would also be helpful!

MattWestb commented 11 months ago

@puddly If you can triggering the bug try only lower the baudrate in the build if the chip is not liking the baudrate: 460800 and the HW flow control is not working OK its can being stalled.

4.3.2 have some interesting fixes ;-)

puddly commented 11 months ago

Here is a build of 4.3.1 for the SkyConnect with the watchdog and without Zigbee Green Power support. Try it out and see if it still crashes for you: skyconnect_wd_nogp_4.3.1.gbl.zip

@MattWestb I will try raising the baudrate to see if that causes it to become more unstable then!

MattWestb commented 11 months ago

@puddly Did you looking on the latest release of EZSP and GSDK 7/4.3.2 then its some bug fixes for the CPC part that can making some problems that is being fixed.

DunklerPhoenix commented 11 months ago

@puddly I flashed your file. The first interesting observation is that the networks (zigbee & thread) started really fast. Normaly thread took over 3 hours minimum. Now it didn't take 20 min.

I will let this firmware run for one or two days to look if its crashing again.

neffs commented 11 months ago

Here is a build of 4.3.1 for the SkyConnect with the watchdog and without Zigbee Green Power support.

@puddly: I flashed it yesterday and it just crashed, unfortunately no improvement for me. I don't have any GP devices, mostly IKEA and some Hue, Aqara, Ledvance and Tuya devices. Crashes 1-2 times per day.

EDIT: Thread is used for 2 Eve Energy devices.

puddly commented 11 months ago

@neffs to make sure we're on the same page, what do you mean by "crashes" here? Does rebooting fix the crash? Or do you physically have to unplug the SkyConnect to fix it?

On the Thread side, what devices do you have?

neffs commented 11 months ago

@neffs to make sure we're on the same page, what do you mean by "crashes" here? Does rebooting fix the crash? Or do you physically have to unplug the SkyConnect to fix it?

I usually reboot the host (HA OS on RPI4), but unplugging Skyconnect + restarting the Addon works as well. I have an automation in place which restarts the host when a certain Zigbee device is reported offline.

On the Thread side, what devices do you have?

2x Eve Energy

puddly commented 11 months ago

Can you upload debug logs from the multiprotocol addon?

Rebooting the host should not allow the firmware to recover so I think your instability may be unrelated to the problem in this specific issue.

neffs commented 11 months ago

Can you upload debug logs from the multiprotocol addon?

First crash last night. I removed the first ~70 MB, until a few minutes before errors appeared. Coprocessor tracing was enabled. multiprotocol_with_coprocessor_tracing.txt

After rebooting the host I disabled tracing and set otbr loglevel to debug. It crashed again a few hours later. This should be the interesting part (I changed some IP addresses):

Oct 20 04:41:51 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:09.920 [I] MeshForwarder-:     src:[fe80:0:0:0:xxxx:xxxx:xxxx:xxxx]:19788
Oct 20 04:41:51 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:09.920 [I] MeshForwarder-:     dst:[ff02:0:0:0:0:0:0:1]:19788
Oct 20 04:41:51 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:09.920 [I] Mle-----------: Receive Advertisement (fe80:0:0:0:xxxx:xxxx:xxxx:xxxx,0x2000)
Oct 20 04:41:52 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:10.847 [I] MeshForwarder-: Received IPv6 UDP msg, len:89, chksum:443b, ecn:no, from:0x2000, sec:yes, prio:normal, rss:-76.0
Oct 20 04:41:52 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:10.848 [I] MeshForwarder-:     src:[fd47:91d7:ce31:1:zzzz:zzzz:zzzz:zzzz]:5540
Oct 20 04:41:52 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:10.848 [I] MeshForwarder-:     dst:[aaaa:bbbb::cccc]:51036
Oct 20 04:41:55 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:13.629 [I] RoutingManager: Sent Neighbor Solicitation to fe80:0:0:0: yyyy: yyyy: yyyy:yyyy - attempt:1/5
Oct 20 04:41:55 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:13.629 [I] RoutingManager: Received NA from router fe80:0:0:0:yyyy:yyyy:yyyy:yyyy
Oct 20 04:41:57 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:15.855 [W] Platform------: radio tx timeout
Oct 20 04:41:57 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:15.855 [W] Platform------: RCP failure detected
Oct 20 04:41:57 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:15.855 [W] Platform------: Trying to recover (1/100)
Oct 20 04:41:57 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:15.855 [I] Platform------: RCP reset: RESET_SOFTWARE
Oct 20 04:41:59 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:17.884 [I] Platform------: Software reset RCP successfully
Oct 20 04:42:01 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:19.900 [W] Platform------: Wait for response timeout
Oct 20 04:42:01 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:19.900 [W] Platform------: RCP failure detected
Oct 20 04:42:01 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:19.900 [W] Platform------: Trying to recover (2/100)
Oct 20 04:42:01 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:19.900 [I] Platform------: RCP reset: RESET_SOFTWARE
Oct 20 04:42:03 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:21.916 [I] Platform------: Software reset RCP successfully
Oct 20 04:42:05 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:23.932 [W] Platform------: Wait for response timeout
Oct 20 04:42:05 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:23.932 [W] Platform------: RCP failure detected
Oct 20 04:42:05 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:23.932 [W] Platform------: Trying to recover (3/100)
Oct 20 04:42:05 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:23.932 [I] Platform------: RCP reset: RESET_SOFTWARE
Oct 20 04:42:07 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:25.948 [I] Platform------: Software reset RCP successfully
Oct 20 04:42:09 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:27.964 [W] Platform------: Wait for response timeout
Oct 20 04:42:09 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:27.964 [W] Platform------: RCP failure detected
Oct 20 04:42:09 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:27.964 [W] Platform------: Trying to recover (4/100)
Oct 20 04:42:09 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:27.965 [I] Platform------: RCP reset: RESET_SOFTWARE
Oct 20 04:42:11 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:29.980 [I] Platform------: Software reset RCP successfully
Oct 20 04:42:13 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:31.996 [W] Platform------: Wait for response timeout
Oct 20 04:42:13 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:31.996 [W] Platform------: RCP failure detected
Oct 20 04:42:13 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:31.996 [W] Platform------: Trying to recover (5/100)
Oct 20 04:42:13 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:31.997 [I] Platform------: RCP reset: RESET_SOFTWARE
Oct 20 04:42:15 homeassistant addon_core_silabs_multiprotocol[562]: otbr-agent[306]: 03:08:34.012 [I] Platform------: Software reset RCP successfully
Oct 20 04:42:16 homeassistant addon_core_silabs_multiprotocol[562]: WARNING in function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12
Oct 20 04:42:16 homeassistant addon_core_silabs_multiprotocol[562]: WARNING in function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1125 : Setting ep#12 in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE
Oct 20 04:42:17 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:16:874628] WARNING : In function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12
Oct 20 04:42:17 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:16:874796] WARNING : In function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1125 : Setting ep#12 in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE
Oct 20 04:42:17 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:16:976274] Info : Client disconnected
Oct 20 04:42:17 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:16:976530] Info : New client connection using library v4.3.1.0
Oct 20 04:42:17 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:16:976772] Info : Client disconnected
Oct 20 04:42:17 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:16:977548] Info : New client connection using library v4.3.1.0
Oct 20 04:42:36 homeassistant addon_core_silabs_multiprotocol[562]: WARNING in function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #0
Oct 20 04:42:36 homeassistant addon_core_silabs_multiprotocol[562]: WARNING in function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1122 : System endpoint in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE. Restarting it.
Oct 20 04:42:36 homeassistant addon_core_silabs_multiprotocol[562]: WARNING in function 'sl_cpc_system_reset_system_endpoint' in file /usr/src/cpc-daemon/server_core/system_endpoint/system.c at line #498 : Dropping system command id #3 seq#12
Oct 20 04:42:36 homeassistant addon_core_silabs_multiprotocol[562]: WARNING in function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1125 : Setting ep#12 in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE
Oct 20 04:42:36 homeassistant addon_core_silabs_multiprotocol[562]: WARNING in function 'on_disconnect_notification' in file /usr/src/cpc-daemon/server_core/core/core.c at line #225 : Failed to receive disconnection notification for ep#12
Oct 20 04:42:37 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:36:980025] WARNING : In function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #0
Oct 20 04:42:37 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:36:980077] WARNING : In function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1122 : System endpoint in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE. Restarting it.
Oct 20 04:42:37 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:36:980088] WARNING : In function 'sl_cpc_system_reset_system_endpoint' in file /usr/src/cpc-daemon/server_core/system_endpoint/system.c at line #498 : Dropping system command id #3 seq#12
Oct 20 04:42:37 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:36:980098] WARNING : In function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1125 : Setting ep#12 in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE
Oct 20 04:42:37 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:36:980109] WARNING : In function 'on_disconnect_notification' in file /usr/src/cpc-daemon/server_core/core/core.c at line #225 : Failed to receive disconnection notification for ep#12
Oct 20 04:42:38 homeassistant addon_core_silabs_multiprotocol[562]: WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #176 : Property-get::PROP_ENDPOINT_STATE aborted
Oct 20 04:42:38 homeassistant addon_core_silabs_multiprotocol[562]: WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #210 : Could not read endpoint state on the secondary
Oct 20 04:42:39 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:38:980366] WARNING : In function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #176 : Property-get::PROP_ENDPOINT_STATE aborted
Oct 20 04:42:39 homeassistant addon_core_silabs_multiprotocol[562]: [06:42:38:980454] WARNING : In function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #210 : Could not read endpoint state on the secondary
theblackhole commented 11 months ago

FYI the 4.3.2/7.3.2 firmware has just been released: https://github.com/NabuCasa/silabs-firmware/commit/b89f14925ddd83ac62df34cb715afc75c476b644

puddly commented 11 months ago

4.3.2 crashes so frequently (i.e. within a minute) that it is completely unusable, unfortunately.

The newly introduced bug seems unrelated so we will have to wait until 4.3.3 is released, hopefully with a fix for both.

theblackhole commented 11 months ago

Oh that too bad. Thanks for the info!

I couldn't even try myself because if I uncheck the Automatically flash firmware option (to avoid a downgrade to 4.3.1) I get this error in an endless loop

Info : Failed to connect, secondary seems unresponsive
Info : Connecting to Secondary...

(I already had this behavior with previous firmwares)

MattWestb commented 11 months ago

@neffs and @puddly Try disabling the OTBR and see if its working better and if yes disabling the Zigbeed and see if its working OK with only OTBR. Looks like its coming from EP12 = Zigbee but the message that is crashing it can also coming from the Thread endpoint but its not so intense used as the Zigbee ones. But if the OTBR handles is crashing the RCP firmware bot endpoints is blocked then its not one real multi taking firmware only task switching future (ala old MACs).

neffs commented 11 months ago

@MattWestb Zigbee works without issues for almost 4 days now. I'll try "OTBR only" next weekend.

MattWestb commented 11 months ago

Great @neffs then i think ts one very strange combination that is doing it and not only some GP frames that is triggering it but Stefan and Puddly is our hope finding and fixing it if not Silabs is doing in the firmware / master of the addon.