openthread / ot-br-posix

OpenThread Border Router, a Thread border router for POSIX-based platforms.
https://openthread.io/
BSD 3-Clause "New" or "Revised" License
419 stars 231 forks source link

Race Condition in otbr-web Start During Boot #2388

Open rwkeane opened 3 months ago

rwkeane commented 3 months ago

Describe the bug A clear and concise description of what the bug is.

I have OTBR currently running on an RPI device (both 3 and 4 repro this issue). I am able to successfully start otbr-agent and it runs fine (including when started automatically as part of the boot process), but otbr-web fails after boot with error [ERR ]-WEB-----: OpenThread daemon is not running.

Running systemctl stop otbr-web and then systemctl stop otbr-web to restart the service fixes the issue, and it runs without any errors.

To Reproduce Information to reproduce the behavior, including:

  1. Git commit id: commit 8e4d36bd3fb1da42630d1d0d884c71afeeb497d1
  2. IEEE 802.15.4 hardware platform: Home Assistant SkyConnect
  3. Build steps: Set up as per the code lab
  4. Network topology: No network currently configured

Expected behavior A clear and concise description of what you expected to happen.

The web service should correctly delay starting until the agent is running.

Console/log output If applicable, add console/log output to help explain your problem.

on boot:

otbr-web (via systemctl status otbr-web.service):

Jul 18 20:33:33 ha-satellite systemd[1]: Started otbr-web.service - Border Router Web.
Jul 18 20:33:33 ha-satellite otbr-web[1398]: otbr-web[1398]: [INFO]-WEB-----: Running 0.3.0-8e4d36b
Jul 18 20:33:33 ha-satellite otbr-web[1398]: otbr-web[1398]: [INFO]-WEB-----: Border router web started on wpan0
Jul 18 20:33:33 ha-satellite otbr-web[1398]: otbr-web[1398]: [ERR ]-WEB-----: OpenThread daemon is not running.
Jul 18 20:33:33 ha-satellite otbr-web[1398]: [INFO]-WEB-----: Running 0.3.0-8e4d36b
Jul 18 20:33:33 ha-satellite otbr-web[1398]: [INFO]-WEB-----: Border router web started on wpan0
Jul 18 20:33:33 ha-satellite otbr-web[1398]: [ERR ]-WEB-----: OpenThread daemon is not running.

otbr-agent (via systemctl status otbr-agent.service):

Jul 18 20:33:34 ha-satellite otbr-agent[1392]: 00:00:00.151 [I] Platform------: Execute command `ipset swap otbr-ingress-deny-src-swap otbr-ingress-deny-src` = 0
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: 00:00:00.154 [I] Platform------: Execute command `ipset swap otbr-ingress-allow-dst-swap otbr-ingress-allow-dst` = 0
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: 00:00:00.154 [I] P-McastRtMgr--: Disable: OK
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: [INFO]-BA------: Publish meshcop service OpenThread BorderRouter #55D9._meshcop._udp.local.
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: 00:00:00.154 [I] RouterTable---: Route table
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: 00:00:00.157 [I] P-Netif-------: Host netif is down
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: 00:00:00.158 [I] P-Netif-------: Succeeded to process request#1
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: [INFO]-MDNS----: Successfully registered service OpenThread BorderRouter #55D9._meshcop._udp
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: [INFO]-BA------: Result of publish meshcop service OpenThread BorderRouter #55D9._meshcop._udp.local: OK
Jul 18 20:33:34 ha-satellite otbr-agent[1392]: [INFO]-BA------: Result of publish meshcop service OpenThread BorderRouter #55D9._meshcop._udp.local: OK

after restarting otbr-web with commands earlier (ignore the timestamps):

Jul 18 20:30:06 ha-satellite systemd[1]: Started otbr-web.service - Border Router Web.
Jul 18 20:30:06 ha-satellite otbr-web[5031]: otbr-web[5031]: [INFO]-WEB-----: Running 0.3.0-8e4d36b
Jul 18 20:30:06 ha-satellite otbr-web[5031]: otbr-web[5031]: [INFO]-WEB-----: Border router web started on wpan0
Jul 18 20:30:06 ha-satellite otbr-web[5031]: [INFO]-WEB-----: Running 0.3.0-8e4d36b
Jul 18 20:30:06 ha-satellite otbr-web[5031]: [INFO]-WEB-----: Border router web started on wpan0

Additional context Add any other context about the problem here.

I've got 3 OTBR devices (RPI3, RPI4, mini computer w/ home assistant) as well as an official Thread Border Router (Nest device) running on my wifi network (though no thread network as of yet). It's possible they are all fighting over the same mDNS name, and the probe -> retry -> repeat... name resolution process is taking a long time

superwhd commented 3 months ago

I guess it's probably because otbr-web didn't wait for the otbr-agent to be ready, which it should have done.

Running systemctl stop otbr-web and then systemctl stop otbr-web to restart the service fixes the issue, and it runs without any errors.

I'm curious why you needed to stop it twice (or is it a typo? πŸ˜ƒ ).

It's possible they are all fighting over the same mDNS name, and the probe -> retry -> repeat... name resolution process is taking a long time

Do you have logs supporting this? As far as I know this should rarely happen because the mDNS service instance name of a device ends with last few digits of its extended MAC address.

rwkeane commented 3 months ago

I'm curious why you needed to stop it twice (or is it a typo? πŸ˜ƒ )

That's just a typo :)

Do you have logs supporting this?

It was only my guess based on Result of publish meshcop service OpenThread BorderRouter #55D9._meshcop._udp.local: OK not occurring until after the otbr-web had already launched (and failed)

superwhd commented 3 months ago

The error [ERR ]-WEB-----: OpenThread daemon is not running should be transient and will be automatically resolved when otbr-agent is launched. I just tried launching otbr-agent after otbr-web and it's working fine.