home-assistant / addons

:heavy_plus_sign: Docker add-ons for Home Assistant
https://home-assistant.io/hassio/
Apache License 2.0
1.52k stars 1.47k forks source link

OTBR Addon Exits with Code 5 #3676

Closed scyto closed 3 months ago

scyto commented 3 months ago

Describe the issue you are experiencing

The OTBR addon no longer stays running as the otbr-agent exits with the following logged in the add-on log. This was working previously in add on version 2.8.0

s6-rc: info: service banner successfully started
s6-rc: info: service universal-silabs-flasher: starting
[20:40:15] INFO: Flashing firmware is disabled
s6-rc: info: service universal-silabs-flasher successfully started
s6-rc: info: service socat-otbr-tcp successfully started
s6-rc: info: service otbr-agent: starting
[20:40:15] INFO: Starting otbr-agent...
[NOTE]-AGENT---: Running 0.3.0-41474ce-dirty
[NOTE]-AGENT---: Thread version: 1.3.0
[NOTE]-AGENT---: Thread interface: wpan0
[NOTE]-AGENT---: Radio URL: spinel+hdlc+uart:///tmp/ttyOTBR?uart-baudrate=460800&uart-flow-control
[NOTE]-AGENT---: Radio URL: trel://enp0s18
[NOTE]-ILS-----: Infra link selected: enp0s18
[INFO]-NCP-----: OpenThread log level changed to 5
57d.20:59:38.622 [C] Platform------: Init() at hdlc_interface.cpp:153: No such file or directory
[20:40:15] WARNING: otbr-agent exited with code 5 (by signal 0).

What type of installation are you running?

Home Assistant OS

Which operating system are you running on?

Home Assistant Operating System

Which add-on are you reporting an issue with?

OpenThread Border Router

What is the version of the add-on?

2.8.0

Steps to reproduce the issue

  1. install 2.6.0 addon, it worked
  2. do some updates
  3. notice that 2.8.0 is not working as expected
  4. this seems to be a regression ...

System Health information

System Information

version core-2024.7.0b6
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.12.4
os_name Linux
os_version 6.6.33-haos
arch x86_64
timezone America/Los_Angeles
config_dir /config
Home Assistant Community Store GitHub API | ok -- | -- GitHub Content | ok GitHub Web | ok GitHub API Calls Remaining | 5000 Installed Version | 1.34.0 Stage | running Available Repositories | 1459 Downloaded Repositories | 42 HACS Data | ok
AccuWeather can_reach_server | ok -- | -- remaining_requests | 31
Home Assistant Cloud logged_in | false -- | -- can_reach_cert_server | ok can_reach_cloud_auth | ok can_reach_cloud | ok
Home Assistant Supervisor host_os | Home Assistant OS 12.4 -- | -- update_channel | beta supervisor_version | supervisor-2024.06.2 agent_version | 1.6.0 docker_version | 26.1.4 disk_total | 125.3 GB disk_used | 20.9 GB healthy | true supported | true host_connectivity | true supervisor_connectivity | true ntp_synchronized | true virtualization | kvm board | ova supervisor_api | ok version_api | ok installed_addons | Studio Code Server (5.15.0), Advanced SSH & Web Terminal (18.0.0), TasmoAdmin (0.30.3), Glances (0.21.1), Network UPS Tools (dev), Spotify Connect (0.13.0), AirCast (4.2.1), Frigate (Full Access) Beta (0.14.0-beta3), ESPHome (2024.6.4), Samba share (12.3.1), Samba Backup (5.2.0), Network UPS Tools (0.13.0), Let's Encrypt (5.0.22), Log Viewer (0.17.0), Matter Server (6.1.2), OpenThread Border Router (2.8.0), SQLite Web (4.1.2)
Universal Devices ISY/IoX host_reachable | ok -- | -- device_connected | true last_heartbeat | June 30, 2024 at 8:44 PM websocket_status | connected
Dashboards dashboards | 3 -- | -- resources | 26 views | 15 mode | storage
Recorder oldest_recorder_run | January 24, 2021 at 11:46 PM -- | -- current_recorder_run | June 30, 2024 at 6:57 PM estimated_db_size | 6.02 MiB database_engine | sqlite database_version | 3.45.3

Anything in the Supervisor logs that might be useful for us?

2024-06-30 20:39:29.529 INFO (SyncWorker_2) [supervisor.docker.manager] Cleaning addon_core_openthread_border_router application
2024-06-30 20:39:30.495 INFO (MainThread) [supervisor.docker.addon] Starting Docker add-on homeassistant/amd64-addon-otbr with version 2.8.0
2024-06-30 20:39:31.669 INFO (MainThread) [supervisor.api.middleware.security] /network/info access from core_openthread_border_router
2024-06-30 20:39:35.421 ERROR (MainThread) [asyncio] Task exception was never retrieved
future: <Task finished name='Task-304866' coro=<Addon.watchdog_container() done, defined at /usr/src/supervisor/supervisor/addons/addon.py:1429> exception=AddonsJobError('Rate limit exceeded, more than 10 calls in 0:30:00')>
Traceback (most recent call last):
  File "/usr/src/supervisor/supervisor/addons/addon.py", line 1443, in watchdog_container
    await self._restart_after_problem(event.state)
  File "/usr/src/supervisor/supervisor/jobs/decorator.py", line 290, in wrapper
    raise on_condition(
supervisor.exceptions.AddonsJobError: Rate limit exceeded, more than 10 calls in 0:30:00
2024-06-30 20:40:13.046 INFO (SyncWorker_0) [supervisor.docker.manager] Cleaning addon_core_openthread_border_router application
2024-06-30 20:40:13.869 INFO (MainThread) [supervisor.docker.addon] Starting Docker add-on homeassistant/amd64-addon-otbr with version 2.8.0
2024-06-30 20:40:15.061 INFO (MainThread) [supervisor.api.middleware.security] /network/info access from core_openthread_border_router
2024-06-30 20:40:18.774 ERROR (MainThread) [asyncio] Task exception was never retrieved
future: <Task finished name='Task-305081' coro=<Addon.watchdog_container() done, defined at /usr/src/supervisor/supervisor/addons/addon.py:1429> exception=AddonsJobError('Rate limit exceeded, more than 10 calls in 0:30:00')>
Traceback (most recent call last):
  File "/usr/src/supervisor/supervisor/addons/addon.py", line 1443, in watchdog_container
    await self._restart_after_problem(event.state)
  File "/usr/src/supervisor/supervisor/jobs/decorator.py", line 290, in wrapper
    raise on_condition(
supervisor.exceptions.AddonsJobError: Rate limit exceeded, more than 10 calls in 0:30:00
2024-06-30 20:40:33.101 ERROR (SyncWorker_6) [supervisor.docker.manager] Container addon_core_openthread_border_router is not running
2024-06-30 20:40:35.748 ERROR (SyncWorker_4) [supervisor.docker.manager] Container addon_core_openthread_border_router is not running

Anything in the add-on logs that might be useful for us?

-----------------------------------------------------------
 Add-on: OpenThread Border Router
 OpenThread Border Router add-on
-----------------------------------------------------------
 Add-on version: 2.8.0
 You are running the latest version of this add-on.
 System: Home Assistant OS 12.4  (amd64 / qemux86-64)
 Home Assistant Core: 2024.7.0b6
 Home Assistant Supervisor: 2024.06.2
-----------------------------------------------------------
 Please, share the above information when looking for help
 or support in, e.g., GitHub, forums or the Discord chat.
-----------------------------------------------------------
s6-rc: info: service banner successfully started
s6-rc: info: service universal-silabs-flasher: starting
[20:40:15] INFO: Flashing firmware is disabled
s6-rc: info: service universal-silabs-flasher successfully started
s6-rc: info: service socat-otbr-tcp successfully started
s6-rc: info: service otbr-agent: starting
[20:40:15] INFO: Starting otbr-agent...
[NOTE]-AGENT---: Running 0.3.0-41474ce-dirty
[NOTE]-AGENT---: Thread version: 1.3.0
[NOTE]-AGENT---: Thread interface: wpan0
[NOTE]-AGENT---: Radio URL: spinel+hdlc+uart:///tmp/ttyOTBR?uart-baudrate=460800&uart-flow-control
[NOTE]-AGENT---: Radio URL: trel://enp0s18
[NOTE]-ILS-----: Infra link selected: enp0s18
[INFO]-NCP-----: OpenThread log level changed to 5
57d.20:59:38.622 [C] Platform------: Init() at hdlc_interface.cpp:153: No such file or directory
[20:40:15] WARNING: otbr-agent exited with code 5 (by signal 0).
[20:40:15] INFO: OTBR firewall teardown completed.
s6-svlisten1: fatal: /run/s6-rc/servicedirs/otbr-agent failed permanently or its supervisor died
s6-rc: warning: unable to start service otbr-agent: command exited 1
s6-rc: info: service legacy-cont-init: stopping
/run/s6/basedir/scripts/rc.init: warning: s6-rc failed to properly bring all the services up! Check your logs (in /run/uncaught-logs/current if you have in-container logging) for more information.
/run/s6/basedir/scripts/rc.init: fatal: stopping the container.
s6-rc: info: service universal-silabs-flasher: stopping
s6-rc: info: service socat-otbr-tcp: stopping
s6-rc: info: service mdns: stopping
Default: mDNSResponder (Engineering Build) (Jun 25 2024 11:10:05) stopping
s6-rc: info: service universal-silabs-flasher successfully stopped
s6-rc: info: service banner: stopping
[20:40:15] INFO: mDNS ended with exit code 4 (signal 0)...
s6-rc: info: service socat-otbr-tcp successfully stopped
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service banner successfully stopped
s6-rc: info: service mdns successfully stopped
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped

Additional information

I checked my network RCP is still listening

Only thing in logs that looks suspicious is this, no idea if it matters?

57d.20:59:38.622 [C] Platform------: Init() at hdlc_interface.cpp:153: No such file or directory
darkxst commented 3 months ago

[NOTE]-AGENT---: Radio URL: spinel+hdlc+uart:///tmp/ttyOTBR?uart-baudrate=460800&uart-flow-control

I can't reproduce this here, what network RCP device are you using? If its SLZB-06 flow control should be disabled.

This is working for me:

[NOTE]-AGENT---: Running 0.3.0-41474ce-dirty
[NOTE]-AGENT---: Thread version: 1.3.0
[NOTE]-AGENT---: Thread interface: wpan0
[NOTE]-AGENT---: Radio URL: spinel+hdlc+uart:///tmp/ttyOTBR?uart-baudrate=460800

57d.20:59:38.622 [C] Platform------: Init() at hdlc_interface.cpp:153: No such file or directory

It would appear that it failed to parse the URL here, and hasnt actually tried opening the port.

scyto commented 3 months ago

what network RCP device are you using

I am using the skyconnect (blue usb key) with the dedicated RCP firmware not the multiprotocol, i needed the flow-control to get it working and found that in a different github issue as a fix to a diff issue - i can try it without it.

It would appear that it failed to parse the URL here

if thats the case can you see anything about the URL is wrong (only diff i see is the flow control)

scyto commented 3 months ago

I disabled flow control and this is the latest logs, seem to be no change.

s6-rc: info: service socat-otbr-tcp successfully started
[10:42:48] INFO: Flashing firmware is disabled
s6-rc: info: service universal-silabs-flasher successfully started
s6-rc: info: service otbr-agent: starting
[10:42:48] INFO: Starting otbr-agent...
[NOTE]-AGENT---: Running 0.3.0-41474ce-dirty
[NOTE]-AGENT---: Thread version: 1.3.0
[NOTE]-AGENT---: Thread interface: wpan0
[NOTE]-AGENT---: Radio URL: spinel+hdlc+uart:///tmp/ttyOTBR?uart-baudrate=460800
[NOTE]-AGENT---: Radio URL: trel://enp0s18
[NOTE]-ILS-----: Infra link selected: enp0s18
[INFO]-NCP-----: OpenThread log level changed to 5
58d.11:02:11.262 [C] Platform------: Init() at hdlc_interface.cpp:153: No such file or directory
[10:42:48] WARNING: otbr-agent exited with code 5 (by signal 0).
[10:42:48] INFO: OTBR firewall teardown completed.
s6-svlisten1: fatal: /run/s6-rc/servicedirs/otbr-agent failed permanently or its supervisor died
s6-rc: warning: unable to start service otbr-agent: command exited 1
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service universal-silabs-flasher: stopping
s6-rc: info: service socat-otbr-tcp: stopping
s6-rc: info: service mdns: stopping
/run/s6/basedir/scripts/rc.init: warning: s6-rc failed to properly bring all the services up! Check your logs (in /run/uncaught-logs/current if you have in-container logging) for more information.
/run/s6/basedir/scripts/rc.init: fatal: stopping the container.
s6-rc: info: service socat-otbr-tcp successfully stopped
Default: mDNSResponder (Engineering Build) (Jun 25 2024 11:10:05) stopping
[10:42:48] INFO: mDNS ended with exit code 4 (signal 0)...
s6-rc: info: service universal-silabs-flasher successfully stopped
s6-rc: info: service banner: stopping
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service mdns successfully stopped
s6-rc: info: service banner successfully stopped
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped
[10:42:53] INFO: The otbr-web is disabled.
[10:42:53] INFO: Enabled socat-otbr-tcp.
s6-rc: info: service socat-otbr-tcp: starting
s6-rc: info: service mdns: starting
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service mdns successfully started
[10:42:53] INFO: Starting mDNS Responder...
Default: mDNSResponder (Engineering Build) (Jun 25 2024 11:10:05) starting
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service banner: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
s6-rc: info: service legacy-cont-init successfully started
[10:42:53] INFO: Starting socat TCP client for OTBR daemon...

-----------------------------------------------------------
 Add-on: OpenThread Border Router
 OpenThread Border Router add-on
-----------------------------------------------------------
 Add-on version: 2.8.0
 You are running the latest version of this add-on.
 System: Home Assistant OS 12.4  (amd64 / qemux86-64)
 Home Assistant Core: 2024.7.0b6
 Home Assistant Supervisor: 2024.06.2
-----------------------------------------------------------
 Please, share the above information when looking for help
 or support in, e.g., GitHub, forums or the Discord chat.
-----------------------------------------------------------
s6-rc: info: service banner successfully started
s6-rc: info: service universal-silabs-flasher: starting
s6-rc: info: service socat-otbr-tcp successfully started
[10:42:54] INFO: Flashing firmware is disabled
s6-rc: info: service universal-silabs-flasher successfully started
s6-rc: info: service otbr-agent: starting
[10:42:54] INFO: Starting otbr-agent...
[NOTE]-AGENT---: Running 0.3.0-41474ce-dirty
[NOTE]-AGENT---: Thread version: 1.3.0
[NOTE]-AGENT---: Thread interface: wpan0
[NOTE]-AGENT---: Radio URL: spinel+hdlc+uart:///tmp/ttyOTBR?uart-baudrate=460800
[NOTE]-AGENT---: Radio URL: trel://enp0s18
[NOTE]-ILS-----: Infra link selected: enp0s18
[INFO]-NCP-----: OpenThread log level changed to 5
58d.11:02:17.117 [C] Platform------: Init() at hdlc_interface.cpp:153: No such file or directory
[10:42:54] WARNING: otbr-agent exited with code 5 (by signal 0).
[10:42:54] INFO: OTBR firewall teardown completed.
s6-svlisten1: fatal: /run/s6-rc/servicedirs/otbr-agent failed permanently or its supervisor died
s6-rc: warning: unable to start service otbr-agent: command exited 1
/run/s6/basedir/scripts/rc.init: warning: s6-rc failed to properly bring all the services up! Check your logs (in /run/uncaught-logs/current if you have in-container logging) for more information.
/run/s6/basedir/scripts/rc.init: fatal: stopping the container.
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service universal-silabs-flasher: stopping
s6-rc: info: service socat-otbr-tcp: stopping
s6-rc: info: service mdns: stopping
s6-rc: info: service universal-silabs-flasher successfully stopped
s6-rc: info: service banner: stopping
Default: mDNSResponder (Engineering Build) (Jun 25 2024 11:10:05) stopping
[10:42:54] INFO: mDNS ended with exit code 4 (signal 0)...
s6-rc: info: service socat-otbr-tcp successfully stopped
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service mdns successfully stopped
s6-rc: info: service banner successfully stopped
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped
scyto commented 3 months ago

this is my yaml for the add-on fwiw

device: /dev/ttyS3
baudrate: "460800"
flow_control: true
autoflash_firmware: false
otbr_log_level: debug
firewall: false
nat64: false
network_device: 192.168.1.176:8000
scyto commented 3 months ago

I tried uninstalling the add-on and re-installing it - no change

scyto commented 3 months ago

I deleted the ser2net container that is used for the RCP. When recreated everything worked as expected.

  1. this means the issue was definitely something to do with connecting to ser2net in the running container on the remote pi.
  2. i assume this is likely a longhaul bug in ser2net (and shows the expected weakness to this approach as per the docs for network based RCPs)
  3. this is not a homeassistant issue - though maybe good if the otbr-addon container can test the RCP is exposed in a functional way and give a meaningful error to stop others like me making bad assumption it was a regression....

note that the ss command indicated the pi running the container was happily listening on the required port, there were no errors in the ser2net containers logs, so might have been issue between ser2net and the RCP - i will dig through logs and see what i can find... if it happens again i will work backwards from the RCP and troubleshoot.