home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.81k stars 30.51k forks source link

ZHA Repeated ConnectionRefused Error (Silabs Multiprotocol) #101432

Closed Californian closed 8 months ago

Californian commented 1 year ago

The problem

Upon restart, and maybe at other times, ZHA fails to start repeatedly. Sometimes, this causes HA to crash entirely, possibly even with a reboot. This issue started when I started using silabs-multiprotocol and, specifically, started using the border router and a thread/matter device. I'm using the SkyConnect on a 1-2 m USB extension cord, far from any major RF interference.

After restarting a few times, it works sometimes. This used to be a reliable fix, but upon upgrading to 2023.10 today, it seems to fail almost every time, and after a period of time.

What version of Home Assistant Core has the issue?

core-2023.10.0

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

ZHA

Link to integration documentation on our website

No response

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

2023-10-04 18:15:26.765 DEBUG (MainThread) [zigpy.serial] Opening a serial connection to 'socket://core-silabs-multiprotocol:9999' (115200 baudrate)
2023-10-04 18:15:26.769 WARNING (MainThread) [bellows.zigbee.application] ControllerApplication reset unsuccessful: ConnectionRefusedError(111, "Connect call failed ('172.30.32.1', 9999)")
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 681, in _reset_controller_loop
    await self._reset_controller()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 702, in _reset_controller
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 137, in connect
    await ezsp.connect(use_thread=self.config[CONF_USE_THREAD])
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 179, in connect
    self._gw = await bellows.uart.connect(self._config, self, use_thread=use_thread)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 414, in connect
    protocol, _ = await _connect(config, application)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 385, in _connect
    transport, protocol = await zigpy.serial.create_serial_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/serial.py", line 42, in create_serial_connection
    transport, protocol = await loop.create_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1085, in create_connection
    raise exceptions[0]
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1069, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 973, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 628, in sock_connect
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 668, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('172.30.32.1', 9999)

Additional information

No response

AndreGoncalvesWave commented 1 year ago

Atualizei o meu e ocorreu a mesma coisa, voltei o backup parou tudo por aqui, utilizo ZHA + /dev/serial/by-id/usb-ITEAD_SONOFF_Zigbee_3.0_USB_Dongle_Plus_V2_20220718181812-if00

Californian commented 1 year ago

I disabled thread, openthreadborderrouter, and matter and enabled them after boot and now it seems to work more reliably.

Californian commented 1 year ago

This is still an issue. Enabling matter, otbr, and thread reliably causes zha to fail; I cannot use any of my Matter devices with HA.

home-assistant[bot] commented 1 year ago

Hey there @dmulcahey, @adminiuga, @puddly, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `zha` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign zha` Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

puddly commented 1 year ago

Can you describe what Thread and Zigbee devices you have and if there is any procedure you can do to trigger this crash?

Californian commented 1 year ago

I have ~100 zigbee devices, mostly hue bulbs, smart plugs, and battery-powered buttons (I'd estimate half of the devices are routers).

I have an Aqara P2 door sensor to test out Matter-over-thread.

I can reliably trigger this by:

I can reliably resolve it after triggering by:

puddly commented 1 year ago

Thanks.

When you restart the Pi, do you completely power cycle it (i.e. shut off power)? Or just a software reboot?

Californian commented 1 year ago

Thanks.

When you restart the Pi, do you completely power cycle it (i.e. shut off power)? Or just a software reboot?

The last couple times I've just been restarting it from the "dev tools" -> "restart" -> "reboot system" UI, not a full power cycle (though I've power cycled in the past when some issue, maybe this one, prevented startup or shutdown).

jtbandes commented 11 months ago

I am experiencing the same issue with:

I had successfully configured and used ZHA with the multiprotocol add-on and it was talking to my devices. Then it randomly stopped working. At first I noticed that it was not even recognizing the dongle in the list of available serial devices according to ZHA — a full system reboot fixed that. Then both OTBR and ZHA would fail to start. After trying to reconfigure and restart things a couple of times, OTBR seems happy but the ZHA integration was still showing an error (it unhelpfully told me to check the logs, but did not say where / which logs to check).

Going deeper down the rabbit hole, I tried deleting the ZHA integration and reconfiguring it. Now I see a generic "Unknown error occurred" message:

image

The raw HTTP response is not helpful either:

500 Internal Server Error

Server got itself in trouble

The silabs multiprotocol add-on itself doesn't look too unhappy in its logs:

...
Listening on port 9999 for connection...
Accepting connection.
Accepted connection 7.
...

However in home-assistant.log I see:

2023-11-08 23:52:10.785 DEBUG (MainThread) [zigpy.serial] Opening a serial connection to 'socket://core-silabs-multiprotocol:9999' (115200 baudrate)
2023-11-08 23:52:10.787 ERROR (MainThread) [aiohttp.server] Error handling request
...long traceback...
ConnectionRefusedError: [Errno 111] Connect call failed ('172.30.32.1', 9999)

Inside the "Terminal & SSH" terminal I am able to ping 172.30.32.1 but cannot connect on port 9999:

[core-ssh ~]$ nslookup core-silabs-multiprotocol
Server:  127.0.0.11
Address:  127.0.0.11#53
Name: core-silabs-multiprotocol
Address: 172.30.32.1
[core-ssh ~]$ busybox-extras telnet 172.30.32.1 9999
telnet: can't connect to remote host (172.30.32.1): Connection refused
[core-ssh ~]$ ping 172.30.32.1
PING 172.30.32.1 (172.30.32.1): 56 data bytes
64 bytes from 172.30.32.1: seq=0 ttl=64 time=0.041 ms
64 bytes from 172.30.32.1: seq=1 ttl=64 time=0.059 ms
^С
--- 172.30.32.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.041/0.050/0.059 ms

More complete logs from this point in time here


After rebooting again, I got past the "Unknown error occurred" and was able to add the integration again. Sadly I now have to rename my devices that have reset to default names, but at least it is working again.

Strangely I now see a "discovered" integration option alongside the one I already set up (I didn't see this before):

image

I do see some odd things in the multiprotocol add-on logs, but I'm not sure if any of these are actually problematic:

s6-rc: info: service legacy-services successfully started
Listening on port 9999 for connection...
Accepting connection.
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::804a:bfff:fe2a:cfdc/veth2790237/25
otbr-agent[299]: 00:00:25.497 [N] Mle-----------: RLOC16 3c00 -> fffe
otbr-agent[299]: 00:00:25.499 [W] Platform------: [netif] Failed to process request#5: Unknown error -95
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::804a:bfff:fe2a:cfdc/veth2790237/25
otbr-agent[299]: 00:00:26.080 [N] Mle-----------: Attach attempt 1, AnyPartition reattaching with Active Dataset
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::804a:bfff:fe2a:cfdc/veth2790237/25
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::804a:bfff:fe2a:cfdc/veth2790237/25
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::804a:bfff:fe2a:cfdc/veth2790237/25
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::804a:bfff:fe2a:cfdc/veth2790237/25
otbr-agent[299]: 00:00:32.580 [N] RouterTable---: Allocate router id 15
otbr-agent[299]: 00:00:32.580 [N] Mle-----------: RLOC16 fffe -> 3c00
otbr-agent[299]: 00:00:32.582 [N] Mle-----------: Role detached -> leader
otbr-agent[299]: 00:00:32.583 [N] Mle-----------: Partition ID 0x46464847
otbr-agent[299]: 00:00:32.595 [W] Platform------: [netif] Failed to process request#6: Unknown error -17
otbr-agent[299]: [NOTE]-BBA-----: BackboneAgent: Backbone Router becomes Primary!
Accepted connection 7.
Socket connection has been closed, restarting...
Listening on port 9999 for connection...
Accepting connection.
Accepted connection 7.
Restarting
[00:09:35:788202] Info : Endpoint socket #12: Client disconnected. 1 connections
[00:09:35:788223] Info : Client disconnected
[00:09:36:788913] Info : New client connection using library v4.3.1.0
[00:09:36:791482] Info : Endpoint socket #12: Client connected. 2 connections
Reusing socket from previous instance.
Socket connection has been closed, restarting...
Listening on port 9999 for connection...
Accepting connection.
Accepted connection 8.
Restarting
[00:09:46:931643] Info : Endpoint socket #12: Client disconnected. 1 connections
[00:09:46:931708] Info : Client disconnected
[00:09:47:933246] Info : New client connection using library v4.3.1.0
[00:09:47:935584] Info : Endpoint socket #12: Client connected. 2 connections
Reusing socket from previous instance.

Thanks for coming to my TED talk. I will report back if it breaks again...

Other things I may try next time:

Based on what I'm reading elsewhere, it sounds like the multiprotocol stuff may just not be working all that well and perhaps I should go back to the single protocol firmware if I want better reliability.

Californian commented 11 months ago

I am experiencing the same issue with:

  • Sonoff Zigbee Dongle E flashed with multiprotocol firmware
  • 3 Zigbee devices (2 philips hue bulbs + 1 inovelli blue switch)
  • OTBR and Matter/Thread are configured, but not used (I previously tried to set up a nanoleaf matter bulb and was unsuccessful, and gave up)

I had successfully configured and used ZHA with the multiprotocol add-on and it was talking to my devices. Then it randomly stopped working. At first I noticed that it was not even recognizing the dongle in the list of available serial devices according to ZHA — a full system reboot fixed that. Then both OTBR and ZHA would fail to start. After trying to reconfigure and restart things a couple of times, OTBR seems happy but the ZHA integration was still showing an error (it unhelpfully told me to check the logs, but did not say where / which logs to check).

Going deeper down the rabbit hole, I tried deleting the ZHA integration and reconfiguring it. Now I see a generic "Unknown error occurred" message:

image

The raw HTTP response is not helpful either:

500 Internal Server Error

Server got itself in trouble

The silabs multiprotocol add-on itself doesn't look too unhappy in its logs:

...
Listening on port 9999 for connection...
Accepting connection.
Accepted connection 7.
...

However in home-assistant.log I see:

2023-11-08 23:52:10.785 DEBUG (MainThread) [zigpy.serial] Opening a serial connection to 'socket://core-silabs-multiprotocol:9999' (115200 baudrate)
2023-11-08 23:52:10.787 ERROR (MainThread) [aiohttp.server] Error handling request
...long traceback...
ConnectionRefusedError: [Errno 111] Connect call failed ('172.30.32.1', 9999)

Inside the "Terminal & SSH" terminal I am able to ping 172.30.32.1 but cannot connect on port 9999:

[core-ssh ~]$ nslookup core-silabs-multiprotocol
Server:  127.0.0.11
Address:  127.0.0.11#53
Name: core-silabs-multiprotocol
Address: 172.30.32.1
[core-ssh ~]$ busybox-extras telnet 172.30.32.1 9999
telnet: can't connect to remote host (172.30.32.1): Connection refused
[core-ssh ~]$ ping 172.30.32.1
PING 172.30.32.1 (172.30.32.1): 56 data bytes
64 bytes from 172.30.32.1: seq=0 ttl=64 time=0.041 ms
64 bytes from 172.30.32.1: seq=1 ttl=64 time=0.059 ms
^С
--- 172.30.32.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.041/0.050/0.059 ms

More complete logs from this point in time here

After rebooting again, I got past the "Unknown error occurred" and was able to add the integration again. Sadly I now have to rename my devices that have reset to default names, but at least it is working again.

Strangely I now see a "discovered" integration option alongside the one I already set up (I didn't see this before):

image

I do see some odd things in the multiprotocol add-on logs, but I'm not sure if any of these are actually problematic:

s6-rc: info: service legacy-services successfully started
Listening on port 9999 for connection...
Accepting connection.
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::804a:bfff:fe2a:cfdc/veth2790237/25
otbr-agent[299]: 00:00:25.497 [N] Mle-----------: RLOC16 3c00 -> fffe
otbr-agent[299]: 00:00:25.499 [W] Platform------: [netif] Failed to process request#5: Unknown error -95
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::804a:bfff:fe2a:cfdc/veth2790237/25
otbr-agent[299]: 00:00:26.080 [N] Mle-----------: Attach attempt 1, AnyPartition reattaching with Active Dataset
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::804a:bfff:fe2a:cfdc/veth2790237/25
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::804a:bfff:fe2a:cfdc/veth2790237/25
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::804a:bfff:fe2a:cfdc/veth2790237/25
Default: mDNSPlatformSendUDP got error 99 (Cannot assign requested address) sending packet to ff02::fb on interface fe80::804a:bfff:fe2a:cfdc/veth2790237/25
otbr-agent[299]: 00:00:32.580 [N] RouterTable---: Allocate router id 15
otbr-agent[299]: 00:00:32.580 [N] Mle-----------: RLOC16 fffe -> 3c00
otbr-agent[299]: 00:00:32.582 [N] Mle-----------: Role detached -> leader
otbr-agent[299]: 00:00:32.583 [N] Mle-----------: Partition ID 0x46464847
otbr-agent[299]: 00:00:32.595 [W] Platform------: [netif] Failed to process request#6: Unknown error -17
otbr-agent[299]: [NOTE]-BBA-----: BackboneAgent: Backbone Router becomes Primary!
Accepted connection 7.
Socket connection has been closed, restarting...
Listening on port 9999 for connection...
Accepting connection.
Accepted connection 7.
Restarting
[00:09:35:788202] Info : Endpoint socket #12: Client disconnected. 1 connections
[00:09:35:788223] Info : Client disconnected
[00:09:36:788913] Info : New client connection using library v4.3.1.0
[00:09:36:791482] Info : Endpoint socket #12: Client connected. 2 connections
Reusing socket from previous instance.
Socket connection has been closed, restarting...
Listening on port 9999 for connection...
Accepting connection.
Accepted connection 8.
Restarting
[00:09:46:931643] Info : Endpoint socket #12: Client disconnected. 1 connections
[00:09:46:931708] Info : Client disconnected
[00:09:47:933246] Info : New client connection using library v4.3.1.0
[00:09:47:935584] Info : Endpoint socket #12: Client connected. 2 connections
Reusing socket from previous instance.

Thanks for coming to my TED talk. I will report back if it breaks again...

Other things I may try next time:

  • disabling OTBR since I'm not using it
  • if that doesn't work, re-flashing multiprotocol firmware on the dongle
  • if that doesn't work, re-flashing back to the single protocol zigbee firmware :(

Based on what I'm reading elsewhere, it sounds like the multiprotocol stuff may just not be working all that well and perhaps I should go back to the single protocol firmware if I want better reliability.

I disabled just OTBR and now it's working for me. Obviously that's not a real solution long-term.

issue-triage-workflows[bot] commented 8 months ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.