tiny-pilot / tinypilot

Use your Raspberry Pi as a browser-based KVM.
https://tinypilotkvm.com
MIT License
2.88k stars 243 forks source link

Janus blocked from restarting if STUN server is unavailable #1648

Open jotaen4tinypilot opened 9 months ago

jotaen4tinypilot commented 9 months ago

We already discovered the (rather unfortunate) Janus startup behavior, where Janus would refuse to come up altogether if the STUN server is not available/reachable at startup time, or if the network server is not available yet.

Steps to reproduce

Systemd now refuses to restart Janus, since it’s flagged as failed. Only after a grace period of a few minutes, systemd accepts startup requests again.

So during failed state, you wouldn’t even be able to revoke the STUN config, or make any other config change to Janus.

More info

With this fix we implemented a workaround that focusses on the scenario with the network service.

There is another problem resulting from this, however: if the user specifies a STUN server that is either not present at all, or down for a longer time, then systemd will try to start Janus 20 times.

https://github.com/tiny-pilot/tinypilot/blob/0a03d2caf54aa083ef66c1606e28321f41aca781/debian-pkg/usr/share/tinypilot/janus.service#L5-L7

After 20 failures, systemd flags the Janus service as failed, so issuing systemctl restart janus or the like doesn’t have any effect.

You can see this in the logs below: at 16:15:11, it tries to start Janus for the 20th time, which fails at 16:15:17. After that, I periodically and manually try to restart the service via /usr/sbin/service janus restart (which is the command we normally use from our Python code), e.g. at 16:15:17, then at 16:16:09, then at 16:16:37, and so on. All those invocations fail right away, because the service is in failed state.

At 16:18:13, however, systemd starts to accept start requests again. Note that this is roughly 180 seconds / 3 minutes after the 20th automated (but failed) startup attempt.

The only way to bypass this blockage and recover the service is by manually issuing systemctl reset-failed janus. That immediately revokes the failed flag. This functionality isn’t available via /usr/sbin/service, however.

Either way, I’m not sure we have fully understood all the different start intervals and retry settings of systemd.

Oct 05 16:15:11 raspberrypi janus[4585]: ---------------------------------------------------
Oct 05 16:15:11 raspberrypi janus[4585]:   Starting Meetecho Janus (WebRTC Server) v1.0.1
Oct 05 16:15:11 raspberrypi janus[4585]: ---------------------------------------------------
Oct 05 16:15:11 raspberrypi janus[4585]: Checking command line arguments...
Oct 05 16:15:11 raspberrypi janus[4585]: Debug/log level is 4
Oct 05 16:15:11 raspberrypi janus[4585]: Debug/log timestamps are disabled
Oct 05 16:15:11 raspberrypi janus[4585]: Debug/log colors are disabled
Oct 05 16:15:11 raspberrypi janus[4585]: Using 192.168.2.2 as local IP...
Oct 05 16:15:11 raspberrypi janus[4585]: Token based authentication disabled
Oct 05 16:15:11 raspberrypi janus[4585]: Initializing recorder code
Oct 05 16:15:11 raspberrypi janus[4585]: Initializing ICE stuff (Full mode, ICE-TCP candidates disabled, half-trickle, IPv6 support disabled)
Oct 05 16:15:11 raspberrypi janus[4585]: STUN server to use: stun.ucsb.edu:3478
Oct 05 16:15:11 raspberrypi janus[4585]:   >> 128.111.24.49:3478 (IPv4)
Oct 05 16:15:11 raspberrypi janus[4585]: Testing STUN server: message is of 20 bytes
Oct 05 16:15:16 raspberrypi janus[4585]: [FATAL] [ice.c:janus_ice_test_stun_server:1078] No response to our STUN BINDING test
Oct 05 16:15:16 raspberrypi janus[4585]: [FATAL] [janus.c:main:4988] Invalid STUN address stun.ucsb.edu:3478
Oct 05 16:15:16 raspberrypi janus[4584]: Error launching Janus (error code 1), check the logs for more details
Oct 05 16:15:16 raspberrypi systemd[1]: janus.service: Control process exited, code=exited, status=1/FAILURE
Oct 05 16:15:16 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:15:16 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:15:17 raspberrypi systemd[1]: janus.service: Scheduled restart job, restart counter is at 20.
Oct 05 16:15:17 raspberrypi systemd[1]: Stopped Janus WebRTC gateway.
Oct 05 16:15:17 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:15:17 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:15:17 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:16:09 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:16:09 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:16:09 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:16:37 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:16:37 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:16:37 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:16:41 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:16:41 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:16:41 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:16:48 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:16:48 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:16:48 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:16:52 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:16:52 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:16:52 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:16:55 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:16:55 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:16:55 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:17:02 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:17:02 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:17:02 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:17:06 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:17:06 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:17:06 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:17:09 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:17:09 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:17:09 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:17:18 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:17:18 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:17:18 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:17:25 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:17:25 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:17:25 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:17:34 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:17:34 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:17:34 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:17:51 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:17:51 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:17:51 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:17:58 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:17:58 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:17:58 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:18:07 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Oct 05 16:18:07 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Oct 05 16:18:07 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Oct 05 16:18:13 raspberrypi systemd[1]: Starting Janus WebRTC gateway...
Oct 05 16:18:13 raspberrypi janus[4722]: Janus version: 1001 (1.0.1)
Oct 05 16:18:13 raspberrypi janus[4722]: Janus commit: not-a-git-repo
Oct 05 16:18:13 raspberrypi janus[4722]: Compiled on:  Mon May  2 06:19:49 UTC 2022
Oct 05 16:18:13 raspberrypi janus[4722]: Running Janus as a daemon
Oct 05 16:18:13 raspberrypi janus[4723]: Logger plugins folder: /usr/lib/arm-linux-gnueabihf/janus/loggers
Oct 05 16:18:13 raspberrypi janus[4723]: [WARN] Logger plugin 'libjanus_jsonlog.so' has been disabled, skipping...
Oct 05 16:18:13 raspberrypi janus[4723]: ---------------------------------------------------
Oct 05 16:18:13 raspberrypi janus[4723]:   Starting Meetecho Janus (WebRTC Server) v1.0.1
Oct 05 16:18:13 raspberrypi janus[4723]: ---------------------------------------------------
Oct 05 16:18:13 raspberrypi janus[4723]: Checking command line arguments...
Oct 05 16:18:13 raspberrypi janus[4723]: Debug/log level is 4
Oct 05 16:18:13 raspberrypi janus[4723]: Debug/log timestamps are disabled
Oct 05 16:18:13 raspberrypi janus[4723]: Debug/log colors are disabled
Oct 05 16:18:13 raspberrypi janus[4723]: Using 192.168.2.2 as local IP...
Oct 05 16:18:13 raspberrypi janus[4723]: Token based authentication disabled
Oct 05 16:18:13 raspberrypi janus[4723]: Initializing recorder code
jotaen4tinypilot commented 8 months ago

I just stumbled across this by accident: in the Janus config, there seems to be a setting ignore_unreachable_ice_server. This setting was added to Janus via https://github.com/meetecho/janus-gateway/pull/1854.

Unfortunately, that PR/fix wasn’t explicitly mentioned in the original issue, which had been closed with the statement that there is no plan to change the behaviour. The PR/fix – which was submitted only a few weeks after said comment – casually mentions the issue, but it didn’t mark the issue as “closed”/“fixed”, so it’s easy to miss. (I, for one, missed it, unfortunately.)

I haven’t tested that setting, but if it does what we want, we might even be able to abandon our service file workaround.