Open jotaen4tinypilot opened 9 months ago
I just stumbled across this by accident: in the Janus config, there seems to be a setting ignore_unreachable_ice_server
. This setting was added to Janus via https://github.com/meetecho/janus-gateway/pull/1854.
Unfortunately, that PR/fix wasn’t explicitly mentioned in the original issue, which had been closed with the statement that there is no plan to change the behaviour. The PR/fix – which was submitted only a few weeks after said comment – casually mentions the issue, but it didn’t mark the issue as “closed”/“fixed”, so it’s easy to miss. (I, for one, missed it, unfortunately.)
I haven’t tested that setting, but if it does what we want, we might even be able to abandon our service file workaround.
We already discovered the (rather unfortunate) Janus startup behavior, where Janus would refuse to come up altogether if the STUN server is not available/reachable at startup time, or if the network server is not available yet.
Steps to reproduce
systemctl restart janus
)journalctl -f -u janus
) and watch systemd trying to start Janus for 20 timesfailed
Systemd now refuses to restart Janus, since it’s flagged as failed. Only after a grace period of a few minutes, systemd accepts startup requests again.
So during
failed
state, you wouldn’t even be able to revoke the STUN config, or make any other config change to Janus.More info
With this fix we implemented a workaround that focusses on the scenario with the network service.
There is another problem resulting from this, however: if the user specifies a STUN server that is either not present at all, or down for a longer time, then systemd will try to start Janus 20 times.
https://github.com/tiny-pilot/tinypilot/blob/0a03d2caf54aa083ef66c1606e28321f41aca781/debian-pkg/usr/share/tinypilot/janus.service#L5-L7
After 20 failures, systemd flags the Janus service as
failed
, so issuingsystemctl restart janus
or the like doesn’t have any effect.You can see this in the logs below: at
16:15:11
, it tries to start Janus for the 20th time, which fails at16:15:17
. After that, I periodically and manually try to restart the service via/usr/sbin/service janus restart
(which is the command we normally use from our Python code), e.g. at16:15:17
, then at16:16:09
, then at16:16:37
, and so on. All those invocations fail right away, because the service is infailed
state.At
16:18:13
, however, systemd starts to accept start requests again. Note that this is roughly 180 seconds / 3 minutes after the 20th automated (but failed) startup attempt.The only way to bypass this blockage and recover the service is by manually issuing
systemctl reset-failed janus
. That immediately revokes thefailed
flag. This functionality isn’t available via/usr/sbin/service
, however.Either way, I’m not sure we have fully understood all the different start intervals and retry settings of systemd.