Silicon Labs Multiprotocol: all of sudden, increased CPU usage HAOS

Anto79-ops commented 1 year ago

Describe the issue you are experiencing

Hello,

I just recently enabled multi-pan on my SkyConnect on HAOS/RPi and ZHA.

Home Assistant 2023.7.2 Supervisor 2023.07.1 Operating System 10.3 Frontend 20230705.1 - latest Silicon Labs Multiprotocol: 2.2.0 ZHA Open Thread Border Router Thread Matter 4.8.3

~~I don't actually have any thread or matter devices (I tried to install nanoleaf, but it simply would not work)~~ I have 2 thread devices now.

I do have 5 google nest hubs that are acting as a TBR of which I added my SkyConnect, (firmware 4.3) too so I have a single thread network with 6 thread border routers.

I had to change my ZHA channel from 20 to 16.

Every once in a while (almost every 2 days), I notice increased CPU usage of HA and glances tells me it's coming from the Silicon Labs Multiprotocol addon (see towards the end this CPU Plot)

The logs for the addon (or at least what I could copy and paste are full of this):

otbr-agent[314]: 02:17:15.577 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:17:15.577 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:17:16.569 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:125, chksum:2edc, ecn:no, to:ea21177b8c9e79e0, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:17:16.570 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:17:16.570 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:17:17.397 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:17:17.525 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:17:17.536 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:17:17.585 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:125, chksum:2bff, ecn:no, to:ea21177b8c9e79e0, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:17:17.585 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:17:17.585 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:18:41.026 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:96, chksum:cda1, ecn:no, to:ea21177b8c9e79e0, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:18:41.026 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:18:41.026 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:18:47.238 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:18:47.276 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:18:47.283 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:18:47.310 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:18:47.400 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:18:47.439 [W] Platform------: Handle transmit done failed: ChannelAccessFailure
otbr-agent[314]: 02:18:47.439 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:125, chksum:6a5f, ecn:no, to:86139233a83bba9e, sec:no, error:ChannelAccessFailure, prio:net
otbr-agent[314]: 02:18:47.440 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:18:47.440 [N] MeshForwarder-:     dst:[fe80:0:0:0:8413:9233:a83b:ba9e]:19788
otbr-agent[314]: 02:18:48.427 [W] Platform------: Handle transmit done failed: ChannelAccessFailure
otbr-agent[314]: 02:18:48.427 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:125, chksum:5a7a, ecn:no, to:86139233a83bba9e, sec:no, error:ChannelAccessFailure, prio:net
otbr-agent[314]: 02:18:48.427 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:18:48.427 [N] MeshForwarder-:     dst:[fe80:0:0:0:8413:9233:a83b:ba9e]:19788
otbr-agent[314]: 02:22:54.556 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:22:54.561 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:22:59.183 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:25:02.201 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:107, chksum:2d52, ecn:no, to:ea21177b8c9e79e0, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:25:02.201 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:25:02.201 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:25:02.379 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:107, chksum:ee99, ecn:no, to:ea21177b8c9e79e0, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:25:02.379 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:25:02.379 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:26:19.104 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:26:19.602 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:26:43.206 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:107, chksum:082b, ecn:no, to:ea21177b8c9e79e0, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:26:43.206 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:26:43.206 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:26:43.395 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:107, chksum:7b2e, ecn:no, to:ea21177b8c9e79e0, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:26:43.395 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:26:43.396 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:26:47.596 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:26:47.733 [W] Platform------: Handle transmit done failed: ChannelAccessFailure
otbr-agent[314]: 02:26:47.733 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:125, chksum:bf46, ecn:no, to:1a8c68c672e778cf, sec:no, error:ChannelAccessFailure, prio:net
otbr-agent[314]: 02:26:47.734 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:26:47.734 [N] MeshForwarder-:     dst:[fe80:0:0:0:188c:68c6:72e7:78cf]:19788
otbr-agent[314]: 02:28:00.302 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:28:00.492 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:125, chksum:55b6, ecn:no, to:86139233a83bba9e, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:28:00.492 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:28:00.492 [N] MeshForwarder-:     dst:[fe80:0:0:0:8413:9233:a83b:ba9e]:19788
otbr-agent[314]: 02:28:01.444 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:125, chksum:50de, ecn:no, to:86139233a83bba9e, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:28:01.444 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:28:01.445 [N] MeshForwarder-:     dst:[fe80:0:0:0:8413:9233:a83b:ba9e]:19788
otbr-agent[314]: 02:28:54.121 [W] Platform------: Handle transmit done failed: ChannelAccessFailure
otbr-agent[314]: 02:28:54.121 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:96, chksum:9d71, ecn:no, to:ea21177b8c9e79e0, sec:no, error:ChannelAccessFailure, prio:net
otbr-agent[314]: 02:28:54.121 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:28:54.121 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:28:55.096 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:96, chksum:b308, ecn:no, to:ea21177b8c9e79e0, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:28:55.096 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:28:55.096 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:28:56.112 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:96, chksum:e24e, ecn:no, to:ea21177b8c9e79e0, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:28:56.112 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:28:56.113 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:29:21.663 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:29:21.663 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:29:21.843 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:138, chksum:cecc, ecn:no, to:ea21177b8c9e79e0, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:29:21.843 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:29:21.844 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:29:23.562 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:29:23.575 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:29:23.597 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:29:23.608 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:29:23.616 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:29:23.634 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:29:23.644 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:29:23.659 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:29:23.668 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:29:56.014 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:95, chksum:0817, ecn:no, to:ea21177b8c9e79e0, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:29:56.014 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:29:56.015 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:30:23.118 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:96, chksum:cf0a, ecn:no, to:86139233a83bba9e, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:30:23.118 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:30:23.118 [N] MeshForwarder-:     dst:[fe80:0:0:0:8413:9233:a83b:ba9e]:19788
otbr-agent[314]: 02:30:24.131 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:96, chksum:7b94, ecn:no, to:86139233a83bba9e, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:30:24.131 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:30:24.131 [N] MeshForwarder-:     dst:[fe80:0:0:0:8413:9233:a83b:ba9e]:19788
otbr-agent[314]: 02:30:48.952 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:95, chksum:01a3, ecn:no, to:ea21177b8c9e79e0, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:30:48.952 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:30:48.952 [N] MeshForwarder-:     dst:[fe80:0:0:0:e821:177b:8c9e:79e0]:19788
otbr-agent[314]: 02:31:03.200 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:31:03.220 [W] Mle-----------: Failed to process UDP: Duplicated
otbr-agent[314]: 02:31:04.029 [N] MeshForwarder-: Failed to send IPv6 UDP msg, len:138, chksum:03b2, ecn:no, to:023c48ca8451ecb7, sec:no, error:NoAck, prio:net
otbr-agent[314]: 02:31:04.030 [N] MeshForwarder-:     src:[fe80:0:0:0:dca7:fa1:377c:5f11]:19788
otbr-agent[314]: 02:31:04.030 [N] MeshForwarder-:     dst:[fe80:0:0:0:3c:48ca:8451:ecb7]:19788
otbr-agent[314]: 02:31:04.940 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:31:05.902 [W] Mle-----------: Failed to process Link Request: Already
otbr-agent[314]: 02:31:05.943 [W] Mle-----------: Failed to process Link Request: Already

stoppinf the addon and restarting it, solves the issue

What type of installation are you running?

Home Assistant OS

Which operating system are you running on?

Home Assistant Operating System

Which add-on are you reporting an issue with?

Silicon Labs Multiprotocol

What is the version of the add-on?

2.2.0

Steps to reproduce the issue

not sure, just happens
...

System Health information

System Information

version	core-2023.7.2
installation_type	Home Assistant OS
dev	false
hassio	true
docker	true
user	root
virtualenv	false
python_version	3.11.4
os_name	Linux
os_version	6.1.21-v8
arch	aarch64
timezone	America/Edmonton
config_dir	/config

Home Assistant Community Store

Home Assistant Cloud

Home Assistant Supervisor

Dashboards

Recorder

Spotify

api_endpoint_reachable | ok -- | --

Anything in the Supervisor logs that might be useful for us?

No response

Anything in the add-on logs that might be useful for us?

No response

Additional information

No response

MattWestb commented 1 year ago

Restart the addon and very likely its starting OK. Its one problem that making the mDNS eating one CPU for 100% and its always being in the code but the reason is still not found.

Anto79-ops commented 1 year ago

yes, definitely restarting the addon solves the issues. thanks.

Anto79-ops commented 1 year ago

As expected restarting the add-on is only a temporary fix.

It's kind of strange it seems to happen every time I have other add-ons being updated. I don't know if that makes sense.

Anto79-ops commented 1 year ago

I just wanted to provide an update that I'm still seeing this issue from time to time. However it's kind of strange it's like every time I'm starting or restarting other add-ons this add-on all of a sudden has elevated CPU usage.

For example today I was reconfiguring the Piper add-on (changing the voice settings) and I stopped Piper and restarted Piper and I noticed my CPU was elevated even after several monutes,and it wasn't because of Piper it was because of the Silicon Labs Multiprotocol addon. I had to restart the Silicon Labs Multiprotocol addon and the cpu calmed again.

Again it just seems to be a bizarre link with other add-ons that is causing this add-on to have elevated CPU.

Why would restarting other other add-ons affect the Silicon Labs Multiprotocol addon?

Anto79-ops commented 1 year ago

Just wanted to mention another instance of this happening there was updates for 4 other of my add-ons (i.e. zwave, adgaurd, sqlite, etc), and again I noticed after those updates were done my CPU was elevated and it was silicon multi-protocol add-on that was elevated. After a restart of the addon,it went back down.

Screenshot_20230807_110617_Home Assistant.jpg

Anybody know why this is happening? Is there some communication that's broken during updates and the multi-protocol add-on does not recover from?

MattWestb commented 1 year ago

Then having CLI to the Linux OS its the mDNS in the OS that is having problems like its something being locked and all processes that is using it can triggering it and its looks the addon is using it most of all so its the most likely getting the problems.

Anto79-ops commented 1 year ago

that's an interesting point. An idea might be:

"maybe supervisor is doing some type of USB scan which is causing the adapter to malfunction"

I do have logs but I suspect these logs were created because I restarted the Silicon Labs addon. Next time I will keep an eye on the logs before restarting that addon to confirm.

Logger: bellows.uart
Source: runner.py:179
First occurred: August 4, 2023 at 6:45:10 PM (4 occurrences)
Last logged: 11:06:20 AM

Lost serial connection: IndexError('index out of range')
Lost serial connection: ConnectionResetError('Remote server closed connection')

Logger: bellows.ezsp
Source: runner.py:179
First occurred: August 4, 2023 at 6:45:10 PM (3 occurrences)
Last logged: 11:06:20 AM

NCP entered failed state. Requesting APP controller restart

Logger: bellows.zigbee.application
Source: /usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py:643
First occurred: August 6, 2023 at 9:08:21 AM (10 occurrences)
Last logged: 11:06:42 AM

ControllerApplication reset unsuccessful: ConnectionRefusedError(111, "Connect call failed ('172.30.32.1', 9999)")
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 640, in _reset_controller_loop
    await self._reset_controller()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 661, in _reset_controller
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 133, in connect
    self._ezsp = await bellows.ezsp.EZSP.initialize(self.config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 164, in initialize
    await ezsp.connect(use_thread=zigpy_config[conf.CONF_USE_THREAD])
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 181, in connect
    self._gw = await bellows.uart.connect(self._config, self, use_thread=use_thread)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 414, in connect
    protocol, _ = await _connect(config, application)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 385, in _connect
    transport, protocol = await zigpy.serial.create_serial_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/serial.py", line 31, in create_serial_connection
    transport, protocol = await loop.create_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1085, in create_connection
    raise exceptions[0]
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1069, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 973, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 628, in sock_connect
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 668, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('172.30.32.1', 9999)

Logger: bellows.ezsp.protocol
Source: runner.py:179
First occurred: August 6, 2023 at 9:08:49 AM (2 occurrences)
Last logged: 11:06:48 AM

Couldn't set EzspConfigId.CONFIG_PACKET_BUFFER_COUNT=255 configuration value: EzspStatus.ERROR_INVALID_ID

MattWestb commented 1 year ago

I can see you have updated the addon then the last error is new for the last version but its not critical but the last Zigbee have more fixes for communication and reset of the RCP that can helping getting the communication working better with the RCP. I dint think its directly connected with USB problems then i was having it one one NanoPi Neo that i was not using the USB only internal comports.

Anto79-ops commented 1 year ago

you're right. I found this upcoming update for the last error:

https://github.com/home-assistant/core/issues/97967

hopefully this will solve the issue. thanks

agners commented 1 year ago

@MattWestb @Anto79-ops I have a build of the add-on with a new mDNSResponder version (1790.80.10). The mDNSResponder needed quite some patching since the new version of the upstream version of the mDNSResponder was buggy, so I am not 100% confident. Also the bump to this new version in upstream OpenThread didn't got merged yet (see https://github.com/openthread/ot-br-posix/pull/1919).

However, I have it running in my production for a couple of days stable here.

You can try this "preview" with access to the OS shell (on port 22222, or the keyboard/screen terminal) using:

docker pull agners/aarch64-addon-silabs-multiprotocol:2.3.2
docker tag agners/aarch64-addon-silabs-multiprotocol:2.3.2 homeassistant/aarch64-addon-silabs-multiprotocol:2.3.1

Then stop and start the add-on (do not use restart, as that doesn't recreate the container).

Note: This is a hack, essentially it overwrites the current version with a different version :blush: The image will get overwritten next time the add-on receives an upgrade. But it works for testing.

Would be interesting to see if this resolves the mDNS CPU spikes for you.

Anto79-ops commented 1 year ago

Hi @agners

Thank you very much for this opportunity to try out the patch.

I'd be happy to install it.

Just a quick a question for installation, I have both access to root from ssh port 2222 and keyboard/screen terminal addon

Do l simply just type the 2 lines in the directory it's in the moment I log in. I.e. pwd is /root

Thanks

agners commented 1 year ago

Do l simply just type the 2 lines in the directory it's in the moment I log in. I.e. pwd is /root

Yes, it doesn't matter in which directory.

Anto79-ops commented 1 year ago

Thanks!

The addon still says its 2.3.1 but I suspect that is expected but just wanted to confirm that.

Here is the cli output for the process

I stopped and started the addon.

MattWestb commented 1 year ago

The 64 bit is only running Ziogbeed but the logs looks like:

[18:22:02] INFO: The otbr-agent is disabled.
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
cont-init: info: running /etc/cont-init.d/check-cpcd-shm.sh
cont-init: info: /etc/cont-init.d/check-cpcd-shm.sh exited 0
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service banner: starting

-----------------------------------------------------------
 Add-on: Silicon Labs Multiprotocol
 Zigbee and OpenThread multiprotocol add-on
-----------------------------------------------------------
 Add-on version: 2.3.1
 You are running the latest version of this add-on.
 System: Armbian 23.08.0-trunk jammy  (aarch64 / raspberrypi3-64)
 Home Assistant Core: 2023.8.0
 Home Assistant Supervisor: 2023.07.1
-----------------------------------------------------------
 Please, share the above information when looking for help
 or support in, e.g., GitHub, forums or the Discord chat.
-----------------------------------------------------------
s6-rc: info: service banner successfully started
s6-rc: info: service universal-silabs-flasher: starting
[18:22:11] INFO: Flashing firmware is disabled
s6-rc: info: service universal-silabs-flasher successfully started
s6-rc: info: service cpcd-config: starting
[18:22:15] INFO: Generating cpcd configuration.
s6-rc: info: service cpcd-config successfully started
s6-rc: info: service cpcd: starting
[18:22:17] INFO: Starting cpcd...
WARNING in function 'main' in file /usr/src/cpc-daemon/main.c at line #186 : Running CPCd as 'root' is not recommended. Proceed at your own risk.
s6-rc: info: service cpcd successfully started
s6-rc: info: service zigbeed: starting
s6-rc: info: service zigbeed successfully started
s6-rc: info: service legacy-services: starting
s6-rc: info: service legacy-services successfully started
[18:22:18] INFO: Starting zigbeed...
[18:22:17:286153] Info : [CPCd v4.3.1.0] [Library API v3] [RCP Protocol v4]
[18:22:17:286342] Info : Git commit: 133b29678b3d0bc7578e098d2f46b4d5bcd2ebb4 / branch: 
[18:22:17:286349] Info : Sources hash: ff8300587e7e4ab1def7a89a272c0baef32f9eb3bff9b0ba06b94e655d652367
[18:22:17:286379] WARNING : In function 'main' in file /usr/src/cpc-daemon/main.c at line #186 : Running CPCd as 'root' is not recommended. Proceed at your own risk.
[18:22:17:286431] Info : Reading cli arguments
[18:22:17:286443] Info : /usr/local/bin/cpcd 
[18:22:17:287730] Info : Reading configuration
[18:22:17:287752] Info :   file_path = /usr/local/etc/cpcd.conf
[18:22:17:287757] Info :   instance_name = cpcd_0
[18:22:17:287763] Info :   socket_folder = /dev/shm
[18:22:17:287768] Info :   operation_mode = MODE_NORMAL
[18:22:17:287771] Info :   use_encryption = false
[18:22:17:287776] Info :   binding_key_file = /etc/binding-key.key
[18:22:17:287781] Info :   stdout_tracing = false
[18:22:17:287786] Info :   file_tracing = false
[18:22:17:287791] Info :   lttng_tracing = false
[18:22:17:287796] Info :   enable_frame_trace = false
[18:22:17:287800] Info :   traces_folder = /dev/shm/cpcd-traces
[18:22:17:287806] Info :   bus = UART
[18:22:17:287811] Info :   uart_baudrate = 230400
[18:22:17:287816] Info :   uart_hardflow = true
[18:22:17:287821] Info :   uart_file = /dev/ttyUSB0
[18:22:17:287826] Info :   fu_recovery_pins_enabled = false
[18:22:17:287831] Info :   fu_connect_to_bootloader = false
[18:22:17:287836] Info :   fu_enter_bootloader = false
[18:22:17:287841] Info :   restart_cpcd = false
[18:22:17:287846] Info :   application_version_validation = false
[18:22:17:287850] Info :   print_secondary_versions_and_exit = false
[18:22:17:287855] Info :   use_noop_keep_alive = false
[18:22:17:287860] Info :   reset_sequence = true
[18:22:17:287864] Info :   stats_interval = 0
[18:22:17:287869] Info :   rlimit_nofile = 2000
[18:22:17:287873] Info : ENCRYPTION IS DISABLED 
[18:22:17:287878] Info : Starting daemon in normal mode
[18:22:17:300198] Info : Connecting to Secondary...
[18:22:17:383456] Info : RX capability is 256 bytes
[18:22:17:383488] Info : Connected to Secondary
[18:22:17:388492] Info : Secondary Protocol v4
[18:22:17:398550] Info : Secondary CPC v4.3.1
[18:22:17:403432] Info : Secondary bus bitrate is 230400
[18:22:17:413512] Info : Secondary APP vUNDEFINED
[18:22:17:413777] Info : Daemon startup was successful. Waiting for client connections
[18:22:19:159791] Info : New client connection using library v4.3.1.0
[18:22:19:165697] Info : Opened connection socket for ep#12
[18:22:19:165935] Info : Endpoint socket #12: Client connected. 1 connections
Listening on port 9999 for connection...
Accepting connection.
Accepted connection 7.

I shall trying updating the 2 32 bits also but i must finding the URL for doing it so need looking in docker for it.

Edit: I was doing one restart so it was not OK and have one start stop and updating the log.

MattWestb commented 1 year ago

I was trying :

root@rk322x-box:~# docker pull agners/armv7-addon-silabs-multiprotocol:2.3.2
Error response from daemon: manifest for agners/armv7-addon-silabs-multiprotocol:2.3.2 not found: manifest unknown: manifest unknown

so i have doing it wrong or its the arm7 not build in the test repro.

agners commented 1 year ago

The addon still says its 2.3.1 but I suspect that is expected but just wanted to confirm that.

Yes, that is expected. The add-on itself reads the version from the container image tag.

Anto79-ops commented 1 year ago

The addon still says its 2.3.1 but I suspect that is expected but just wanted to confirm that.

Yes, that is expected. The add-on itself reads the version from the container image tag.

Awesome. By the way there was an update for Z-WaveJS UI today (another add-on that I have) and I did not experience the elevated CPU after updating that add-on.

So far so good 👍

agners commented 1 year ago

so i have doing it wrong or its the arm7 not build in the test repro.

The armv7 image was missing. I just built and pushed it.

MattWestb commented 1 year ago

RK3228 on armbian Ubuntu 32 is up and running and looks working OK:

[07:28:03] INFO: Starting cpcd...
WARNING in function 'main' in file /usr/src/cpc-daemon/main.c at line #186 : Running CPCd as 'root' is not recommended. Proceed at your own risk.
s6-rc: info: service cpcd successfully started
s6-rc: info: service zigbeed: starting
s6-rc: info: service otbr-agent: starting
s6-rc: info: service zigbeed successfully started
[07:28:03:151076] Info : [CPCd v4.3.1.0] [Library API v3] [RCP Protocol v4]
[07:28:03:151533] Info : Git commit: 133b29678b3d0bc7578e098d2f46b4d5bcd2ebb4 / branch: 
[07:28:03:151555] Info : Sources hash: ff8300587e7e4ab1def7a89a272c0baef32f9eb3bff9b0ba06b94e655d652367
[07:28:03:151647] WARNING : In function 'main' in file /usr/src/cpc-daemon/main.c at line #186 : Running CPCd as 'root' is not recommended. Proceed at your own risk.
[07:28:03:152535] Info : Reading cli arguments
[07:28:03:152621] Info : /usr/local/bin/cpcd 
[07:28:03:163435] Info : Reading configuration
[07:28:03:163494] Info :   file_path = /usr/local/etc/cpcd.conf
[07:28:03:163510] Info :   instance_name = cpcd_0
[07:28:03:163522] Info :   socket_folder = /dev/shm
[07:28:03:163533] Info :   operation_mode = MODE_NORMAL
[07:28:03:163543] Info :   use_encryption = false
[07:28:03:163553] Info :   binding_key_file = /etc/binding-key.key
[07:28:03:163564] Info :   stdout_tracing = false
[07:28:03:163574] Info :   file_tracing = false
[07:28:03:163584] Info :   lttng_tracing = false
[07:28:03:163593] Info :   enable_frame_trace = false
[07:28:03:163603] Info :   traces_folder = /dev/shm/cpcd-traces
[07:28:03:163613] Info :   bus = UART
[07:28:03:163623] Info :   uart_baudrate = 230400
[07:28:03:163633] Info :   uart_hardflow = true
[07:28:03:163643] Info :   uart_file = /dev/ttyUSB0
[07:28:03:163654] Info :   fu_recovery_pins_enabled = false
[07:28:03:163664] Info :   fu_connect_to_bootloader = false
[07:28:03:163675] Info :   fu_enter_bootloader = false
[07:28:03:163684] Info :   restart_cpcd = false
[07:28:03:163694] Info :   application_version_validation = false
[07:28:03:163704] Info :   print_secondary_versions_and_exit = false
[07:28:03:163714] Info :   use_noop_keep_alive = false
[07:28:03:163723] Info :   reset_sequence = true
[07:28:03:163733] Info :   stats_interval = 0
[07:28:03:163743] Info :   rlimit_nofile = 2000
[07:28:03:163754] Info : ENCRYPTION IS DISABLED 
[07:28:03:163763] Info : Starting daemon in normal mode
[07:28:03:178122] Info : Connecting to Secondary...
[07:28:03:267244] Info : RX capability is 256 bytes
[07:28:03:267319] Info : Connected to Secondary
[07:28:03:272287] Info : Secondary Protocol v4
[07:28:03:282216] Info : Secondary CPC v4.3.1
[07:28:03:287313] Info : Secondary bus bitrate is 230400
[07:28:03:297293] Info : Secondary APP vUNDEFINED
[07:28:03:297758] Info : Daemon startup was successful. Waiting for client connections
[07:28:04] INFO: Starting zigbeed...
[07:28:05:469119] Info : New client connection using library v4.3.1.0
[07:28:05:475796] Info : Opened connection socket for ep#12
[07:28:05:476362] Info : Endpoint socket #12: Client connected. 1 connections
[07:28:07] INFO: Setup OTBR firewall...
[07:28:07] INFO: Starting otbr-agent...
otbr-agent[328]: [NOTE]-AGENT---: Running 0.3.0
otbr-agent[328]: [NOTE]-AGENT---: Thread version: 1.3.0
otbr-agent[328]: [NOTE]-AGENT---: Thread interface: wpan0
otbr-agent[328]: [NOTE]-AGENT---: Radio URL: spinel+cpc://cpcd_0?iid=2&iid-list=0
otbr-agent[328]: [NOTE]-ILS-----: Infra link selected: eth0
otbr-agent[328]: 55d.12:58:13.748 [C] Platform------: mCpcBusSpeed = 115200
[07:28:07:838472] Info : New client connection using library v4.3.1.0
[07:28:07:844997] Info : Endpoint socket #12: Client connected. 2 connections
Listening on port 9999 for connection...
Accepting connection.
otbr-agent[328]: 00:00:00.127 [N] RoutingManager: BR ULA prefix: fd3e:8467:1471::/48 (loaded)
otbr-agent[328]: 00:00:00.127 [N] RoutingManager: Local on-link prefix: fdff:1111:1122:22ee::/64
otbr-agent[328]: 00:00:00.252 [N] Mle-----------: Role disabled -> detached
otbr-agent[328]: 00:00:00.277 [N] Platform------: [netif] Changing interface state to up.
s6-rc: info: service otbr-agent successfully started
s6-rc: info: service otbr-agent-rest-discovery: starting
s6-rc: info: service otbr-web: starting
s6-rc: info: service otbr-web successfully started
[07:28:10] INFO: Starting otbr-web...
otbr-web[413]: [INFO]-WEB-----: Running 0.3.0
otbr-web[413]: [INFO]-WEB-----: Border router web started on wpan0
otbr-agent[328]: 00:00:00.998 [N] Mle-----------: Role detached -> router
otbr-agent[328]: 00:00:01.001 [N] Mle-----------: Partition ID 0x3eb383d7
otbr-agent[328]: [NOTE]-BBA-----: BackboneAgent: Backbone Router becomes Primary!
otbr-agent[328]: 00:00:01.621 [W] Platform------: [netif] Failed to process request#9: Unknown error -95
otbr-agent[328]: 00:00:01.625 [W] Platform------: [netif] Failed to process request#10: Unknown error -95
Accepted connection 7.
otbr-agent[328]: 00:00:03.581 [N] MeshForwarder-: Dropping IPv6 UDP msg, len:138, chksum:14c7, ecn:no, sec:no, prio:net
otbr-agent[328]: 00:00:03.582 [N] MeshForwarder-:     src:[fe80:0:0:0:9847:1086:d2e0:5228]:19788
otbr-agent[328]: 00:00:03.582 [N] MeshForwarder-:     dst:[ff02:0:0:0:0:0:0:1]:19788
[07:28:15] INFO: Successfully sent discovery information to Home Assistant.
s6-rc: info: service otbr-agent-rest-discovery successfully started
s6-rc: info: service legacy-services: starting
s6-rc: info: service legacy-services successfully started
otbr-agent[328]: 00:00:07.976 [W] Platform------: radio tx timeout
otbr-agent[328]: 00:00:07.976 [W] Platform------: RCP failure detected
otbr-agent[328]: 00:00:07.977 [W] Platform------: Trying to recover (1/100)
otbr-agent[328]: 00:00:09.806 [N] Platform------: RCP recovery is done
otbr-agent[328]: 00:00:09.883 [N] MeshForwarder-: Dropping IPv6 UDP msg, len:138, chksum:929c, ecn:no, sec:no, prio:net
otbr-agent[328]: 00:00:09.884 [N] MeshForwarder-:     src:[fe80:0:0:0:9847:1086:d2e0:5228]:19788
otbr-agent[328]: 00:00:09.885 [N] MeshForwarder-:     dst:[ff02:0:0:0:0:0:0:1]:19788
otbr-agent[328]: 00:00:10.002 [W] Platform------: [netif] Failed to process request#11: Unknown error -17
otbr-agent[328]: 00:00:12.133 [W] Platform------: [netif] Failed to process request#12: Unknown error -17
otbr-agent[328]: 00:00:13.926 [N] MeshForwarder-: Dropping (reassembly queue) IPv6 UDP msg, len:171, chksum:6882, ecn:no, sec:yes, error:ReassemblyTimeout, prio:net, rss:-50.0
otbr-agent[328]: 00:00:13.928 [N] MeshForwarder-:     src:[fe80:0:0:0:bcf1:e126:7fd8:ce92]:19788
otbr-agent[328]: 00:00:13.929 [N] MeshForwarder-:     dst:[ff02:0:0:0:0:0:0:1]:19788

I updating the NanoPI Neo after restart of my laptop then its like doing Windows update. Shall i activating the OTBR in it and testing then i have disabling it for not braking the routing in HA that was not working OK then having more OTBR up and running ?

Edit: Bringing up one more OTBR is making HA loosing contact with the thread devices as expected so its no go.

Anto79-ops commented 1 year ago

Hi

I just wanted to provide an update that since using the "hacked" version of this addon, I have not experienced elevated cpu readings, including when a few addons have been updated

I'd be willing to close this issue but perhaps maybe a few more days or a few more updates for add-ons just to make sure but so far this is going really well. 👍 thabks @agners

agners commented 1 year ago

On my production instance, I've noticed that the mDNS announcements for the _meshcop._udp (OpenThread) service stopped suddenly. This made the "Silicon Labs Multiprotocol" Thread border router instance disappear from the Thread panel. Nothing unusual in the logs other than the occasional Failed to send IPv6 UDP msg, len:90, chksum:cba6, ecn:no, to:0xffff, sec:no, error:NoAck, prio:net every couple of hours, it doesn't seem to be problem for the Thread communication as my Thread devices all still worked just fine). However, my logs went only back ~3 days or so, so maybe there was a more serious issue reported earlier and I did not notice.

So while I did not have the 100% CPU issue, this "lost discovery" is a bit concerning to me. I've restarted the add-on, and the Thread instance appears again, the DNS-SD messages are sent out again. I'll monitor it again.

Anyone else seen this?

Anto79-ops commented 1 year ago

Hi,

I don't think I've experienced this, checking my logs I do see that Failed to send IPv6 UDP msg, len:90, chksum:cba6, ecn:no, to:0xffff, sec:no, error:NoAck, prio:net but my SkyConnect has always been listed a router in the thread panel. Unless its intermittent?

let me know if you would like me to dig deeper.

Anto79-ops commented 1 year ago

Just wanted to provide another update. Again more good news there have been more add-on updates. More time has passed and I have not experienced an elevated CPU usage since installing the hacked version of the Silicon Labs Multiprotocol addon.

As far as the CPU utilization concern, I'd consider it solved. However, I have not been able to see the issue that you've experienced so I will leave this issue up to you whether or not you want to close it or keep it open.

Thanks again @agners for your help, and please let me know how I can contribute more if needed.

itpeters commented 1 year ago

Hi @agners -- just found this report after some googling while trying to understand why my border router kept disappearing from my thread panel in HA. I've been experiencing the exact same cluster of symptoms (border router isn't showing up in the thread panel, currently configured matter over thread devices continue functioning, multiprotocol addon consuming one full CPU core).

Is there any benefit to you having another person pull your custom docker image for testing? I see that you marked #3169 as ready for review a few hours ago so if there's no need for additional testing I'm happy to just wait.

Thanks!

Anto79-ops commented 1 year ago

Hi all, I just wanted to confirm the issue of the disappearing thread border router device. When it happens do you have to do any actions to get it back like reload or something like that?

Maybe the closest thing I've experienced with this is that when I do open the thread panel it takes a second or two for the border router devices to populate but they always repopulate of the times I've done it. The skyconnect was always on the thread panel.

agners commented 1 year ago

@itpeters I just merged that commit, so it will soon appear as an add-on update.

As for disappearing Thread border router from the Thread panel: This is a different issue it seems. I first thought it was a regression of #3169 (since I was running a version with that change), but since you were running the version before (and also others reported they observed the same while running the old version), it doesn't seem to be a regression.

The disappearing Thread border router warrants another issue. If anyone catches the add-on behave like that, please post a new issue along with the add-on logs.

I am closing this issue since it is about the 100% issue which is expected to be resolved with #3169.

itpeters commented 1 year ago

@agners opened #3189 with what info I have, as I'm able to reproduce the disappearing border router constantly.

agners commented 1 year ago

I've just had the 100% CPU case in the pure OpenThread Border Router add-on. It seems that in this case the mdnsd is really out of order:

otbr-agent[15369]: [ERR ]-MDNS----: Failed to publish service Home Assistant OpenThread Border Router #5129._meshcop._udp for mdnssd error: Timeout!
otbr-agent[15369]: [WARN]-BA------: Result of publish meshcop service Home Assistant OpenThread Border Router #5129._meshcop._udp.local: MDNS error

MattWestb commented 1 year ago

Is the addon using its own mDNS or its it using the OS one ? Then i remember right is htop shoeing only mDNS thread / posses and is using 100% CPU or is it one mismatch of versions from the addon and the OS version that i doing it ?

agners commented 1 year ago

The add-on brings it's own mDNS suppport, it uses the mDNSResponder from Apple.

If htop was showing mdnsd then that was the mDNSResponder.

Home Assistant OS itself has systemd-resolved with performs mDNS and DNS-SD operations. But we aren't really make use of that.

Home Assistant Core brings it's own mDNS implementation provided by https://github.com/python-zeroconf/python-zeroconf.

home-assistant / addons