Closed jotaen4tinypilot closed 1 year ago
@mtlynch would you consider this to be blocking in regards to https://github.com/tiny-pilot/tinypilot/pull/1579, or shall/can we move forward anyway? (I haven’t requested a review on that PR yet.)
On the one hand, I think we have to address this problem before releasing, on the other hand, the STUN logic is “internal” / WIP for the time being, so it’s not exposed to customers yet.
@jotaen4tinypilot - Nope, not blocking that PR, but blocking for adding support for STUN to the web UI.
Is there another way of fixing this without bringing back https://github.com/tiny-pilot/janus-debian ?
I'd been considering bringing that repo back anyway because I'm worried that adding a whole backports repo to apt-get might have unintended consequences.
Is there another way of fixing this without bringing back https://github.com/tiny-pilot/janus-debian ?
I think that would be worth to investigate. I haven’t looked into it yet, so e.g. I haven’t confirmed the After=network-online.target
“workaround” from the Janus issue. Although that suggestion seems sensible, there might be other solutions that would be easier for us to achieve.
I made a test build of Janus with After=network-online.target
:
https://p.tinypilotkvm.com/-Jdbait87RZ/janus_20230823134012-1_armhf.deb
https://github.com/tiny-pilot/janus-debian/pull/18
Also, with Docker Layer Caching enabled, the Debian build drops from 20m to 20s, so quite a speedup!
The systemd documentation suggests to use both After=network-online.target
and Wants=network-online.target
, in order to ensure network connectivity.
I just tried out both variants on device:
After=
didn’t do the trick for meAfter=
and Wants=
seemed to workSo by adding the following code to /lib/systemd/system/janus.service
, I could see that the Janus systemd service came up after booting.
After=network-online.target
Wants=network-online.target
One additional note: Using network-online.target
might have unintended side-effects, since it can delay the boot procedure, or cause other glitches during startup. From reading that FAQ section, we might have to be careful to just depend on network-online.target
as a general setting for the Janus service. Maybe it would be safer to only use it when STUN is enabled? At least we probably should do some testing otherwise, to ensure that this setting wouldn’t affect TinyPilot in suboptimal network situations, e.g. when DHCP service is laggy, or the network is otherwise unreliable.
Oh yeah, that seems like it could create some headaches.
Maybe it would be safer to only use it when STUN is enabled?
We could, but having to dynamically rewrite and reload the systemd service definition every time we configure Janus seems like it's going to be a big pain.
Maybe the easier solution is to just give it more liberal restarts like we do with load-tc358743
:
We could, but having to dynamically rewrite and reload the systemd service definition every time we configure Janus seems like it's going to be a big pain.
Is it, though? We already rewrite the Janus config itself, which we render from our template, and we have to restart the Janus systemd service afterwards to effectuate the change. (See https://github.com/tiny-pilot/tinypilot/pull/1579.) So assuming that the Janus service file would also live within the tinypilot repo directly, it’s “just” another render-template
invocation, right?
Maybe the easier solution is to just give it more liberal restarts like we do with load-tc358743:
I’ve successfully tried with the following service config parameters: (Stripped irrelevant parameters for conciseness.)
[Unit]
StartLimitIntervalSec=300
StartLimitBurst=20
After=network.target
[Service]
RestartSec=1
I.e., 20 retries in 300 seconds max, with a 1 second delay between retries.
The 1 second backoff seems crucial, otherwise the retries occur too rapidly, and it would hence give up “too early” when reaching the burst limit. We probably should consider an even higher burst limit, to provide more margin to account for the unpredictability of networking.
It looks like there are two options on the table:
network-online
:
Of course we could also combine both approaches.
In any event, it looks like we have to take control of the Janus service file one way or the other. At least no other option occurred to me how we could solve this. To me, it’s still a bummer that Janus behaves that way, but of course there is no point in arguing about that, since we just have to deal with how it is.
Is it, though? We already rewrite the Janus config itself, which we render from our template, and we have to restart the Janus systemd service afterwards to effectuate the change. (See https://github.com/tiny-pilot/tinypilot/pull/1579.) So assuming that the Janus service file would also live within the tinypilot repo directly, it’s “just” another render-template invocation, right?
Yeah, I guess "big pain" is an exaggeration. But it prevents us from using dh_installsystemd
like we do for other services.
Let's punt this bug until 2.6.2. I want to cut 2.6.1 in the next couple of weeks, and the Janus stuff is starting a cascade of additional tasks that's going to push out the release date. We made some progress in that Janus can integrate STUN settings if the user edits their settings.yml
, so we can offer it as a manual solution until we add in UI support.
Per our dev meeting today, we can probably adjust our systemd config without having to build our own Janus package again:
https://askubuntu.com/questions/659267/how-do-i-override-or-configure-systemd-services/659268#659268
Related https://github.com/tiny-pilot/tinypilot/issues/1460.
When a STUN server is specified in the main Janus config, then the Janus systemd service fails to come up right after booting the device. Only when starting the Janus systemd service manually, Janus will come up successfully.
For enabling STUN, append the following block to
/etc/janus/janus.jcfg
:You can also use any other public STUN server, such as
stun.l.google.com:19302
.Investigation
The logs indicate that this is an issue related to Janus failing to connect to the STUN server. In the first log batch (see below), the logs say
[ERR] [ice.c:janus_ice_set_stun_server:1153] Could not resolve stun.gmx.de...
. After 5 failed attempts to start the Janus service, systemd gives up. (janus.service: Scheduled restart job, restart counter is at 5.
.)After logging in via SSH to the device after it has booted, you can manually issue
systemctl restart janus
. As you see in the second log batch below, this will make Janus start successfully. Connecting to the STUN server now succeeds. (>> 212.227.67.34:3478 (IPv4)
).This issue in the Janus repo indicates that this is “simply” a network problem, that occurs when we start Janus before the network is available on the device. So we might be able to fix this by adding
After=network-online.target
to the Janus service file. (Note, we don’t have our own Janus systemd file yet.)Regardless of how we fix this, I think it’s important for us to realize that Janus will fail altogether if it’s not able to reach the desired STUN server for whatever reason. This will make the video stream fall back to MJPEG, which I think isn’t particularly graceful behaviour. Not sure whether we’d be able to mitigate this somehow, or whether we’d have to live with that.
Janus systemd logs right after device boot
Janus systemd logs after manual restart (a few moments later)