tiny-pilot / tinypilot

Use your Raspberry Pi as a browser-based KVM.
https://tinypilotkvm.com
MIT License
3.01k stars 252 forks source link

Janus+STUN fails to come up right after booting device #1578

Closed jotaen4tinypilot closed 1 year ago

jotaen4tinypilot commented 1 year ago

Related https://github.com/tiny-pilot/tinypilot/issues/1460.

When a STUN server is specified in the main Janus config, then the Janus systemd service fails to come up right after booting the device. Only when starting the Janus systemd service manually, Janus will come up successfully.

For enabling STUN, append the following block to /etc/janus/janus.jcfg:

nat: {
  stun_server = "stun.gmx.de"
  stun_port = 3478
}

You can also use any other public STUN server, such as stun.l.google.com:19302.

Investigation

The logs indicate that this is an issue related to Janus failing to connect to the STUN server. In the first log batch (see below), the logs say [ERR] [ice.c:janus_ice_set_stun_server:1153] Could not resolve stun.gmx.de.... After 5 failed attempts to start the Janus service, systemd gives up. (janus.service: Scheduled restart job, restart counter is at 5..)

After logging in via SSH to the device after it has booted, you can manually issue systemctl restart janus. As you see in the second log batch below, this will make Janus start successfully. Connecting to the STUN server now succeeds. (>> 212.227.67.34:3478 (IPv4)).

This issue in the Janus repo indicates that this is “simply” a network problem, that occurs when we start Janus before the network is available on the device. So we might be able to fix this by adding After=network-online.target to the Janus service file. (Note, we don’t have our own Janus systemd file yet.)

Regardless of how we fix this, I think it’s important for us to realize that Janus will fail altogether if it’s not able to reach the desired STUN server for whatever reason. This will make the video stream fall back to MJPEG, which I think isn’t particularly graceful behaviour. Not sure whether we’d be able to mitigate this somehow, or whether we’d have to live with that.

Janus systemd logs right after device boot

Aug 22 18:52:53 raspberrypi janus[727]: ---------------------------------------------------
Aug 22 18:52:53 raspberrypi janus[727]:   Starting Meetecho Janus (WebRTC Server) v1.0.1
Aug 22 18:52:53 raspberrypi janus[727]: ---------------------------------------------------
Aug 22 18:52:53 raspberrypi janus[727]: Checking command line arguments...
Aug 22 18:52:53 raspberrypi janus[727]: Debug/log level is 4
Aug 22 18:52:53 raspberrypi janus[727]: Debug/log timestamps are disabled
Aug 22 18:52:53 raspberrypi janus[727]: Debug/log colors are disabled
Aug 22 18:52:53 raspberrypi janus[727]: [WARN] Couldn't find any address! using 127.0.0.1 as the local IP... (which is NOT going to work out of your machi>
Aug 22 18:52:53 raspberrypi janus[727]: Using 127.0.0.1 as local IP...
Aug 22 18:52:53 raspberrypi janus[727]: Token based authentication disabled
Aug 22 18:52:53 raspberrypi janus[727]: Initializing recorder code
Aug 22 18:52:53 raspberrypi janus[727]: Initializing ICE stuff (Full mode, ICE-TCP candidates disabled, half-trickle, IPv6 support disabled)
Aug 22 18:52:53 raspberrypi janus[727]: STUN server to use: stun.gmx.de:3478
Aug 22 18:52:53 raspberrypi janus[727]: [ERR] [ice.c:janus_ice_set_stun_server:1153] Could not resolve stun.gmx.de...
Aug 22 18:52:53 raspberrypi janus[727]: [FATAL] [janus.c:main:4988] Invalid STUN address stun.gmx.de:3478
Aug 22 18:52:53 raspberrypi janus[726]: Error launching Janus (error code 1), check the logs for more details
Aug 22 18:52:53 raspberrypi systemd[1]: janus.service: Control process exited, code=exited, status=1/FAILURE
Aug 22 18:52:53 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Aug 22 18:52:53 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.
Aug 22 18:52:54 raspberrypi systemd[1]: janus.service: Scheduled restart job, restart counter is at 5.
Aug 22 18:52:54 raspberrypi systemd[1]: Stopped Janus WebRTC gateway.
Aug 22 18:52:54 raspberrypi systemd[1]: janus.service: Start request repeated too quickly.
Aug 22 18:52:54 raspberrypi systemd[1]: janus.service: Failed with result 'exit-code'.
Aug 22 18:52:54 raspberrypi systemd[1]: Failed to start Janus WebRTC gateway.

Janus systemd logs after manual restart (a few moments later)

Aug 22 18:53:07 raspberrypi janus[1006]: ---------------------------------------------------
Aug 22 18:53:07 raspberrypi janus[1006]:   Starting Meetecho Janus (WebRTC Server) v1.0.1
Aug 22 18:53:07 raspberrypi janus[1006]: ---------------------------------------------------
Aug 22 18:53:07 raspberrypi janus[1006]: Checking command line arguments...
Aug 22 18:53:07 raspberrypi janus[1006]: Debug/log level is 4
Aug 22 18:53:07 raspberrypi janus[1006]: Debug/log timestamps are disabled
Aug 22 18:53:07 raspberrypi janus[1006]: Debug/log colors are disabled
Aug 22 18:53:07 raspberrypi janus[1006]: Using 192.168.2.2 as local IP...
Aug 22 18:53:07 raspberrypi janus[1006]: Token based authentication disabled
Aug 22 18:53:07 raspberrypi janus[1006]: Initializing recorder code
Aug 22 18:53:07 raspberrypi janus[1006]: Initializing ICE stuff (Full mode, ICE-TCP candidates disabled, half-trickle, IPv6 support disabled)
Aug 22 18:53:07 raspberrypi janus[1006]: STUN server to use: stun.gmx.de:3478
Aug 22 18:53:07 raspberrypi janus[1006]:   >> 212.227.67.34:3478 (IPv4)
Aug 22 18:53:07 raspberrypi janus[1006]: Testing STUN server: message is of 20 bytes
Aug 22 18:53:07 raspberrypi janus[1006]:   >> Our public address is 89.247.174.119
Aug 22 18:53:07 raspberrypi janus[1006]: TURN REST API backend: (disabled)
Aug 22 18:53:07 raspberrypi janus[1006]: Crypto: OpenSSL >= 1.1.0
Aug 22 18:53:07 raspberrypi janus[1006]: No cert/key specified, autogenerating some...
Aug 22 18:53:07 raspberrypi janus[1006]: Fingerprint of our certificate: 01:6A:E3:E4:46:ED:3B:D3:C5:14:69:46:58:1C:37:35:B3:03:21:CE:13:86:A8:4B:55:1C:10:>
Aug 22 18:53:07 raspberrypi janus[1006]: Sessions watchdog started
Aug 22 18:53:07 raspberrypi janus[1006]: Event handlers support disabled
Aug 22 18:53:07 raspberrypi janus[1006]: Plugins folder: /usr/lib/arm-linux-gnueabihf/janus/plugins
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Plugin 'libjanus_textroom.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Plugin 'libjanus_echotest.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Plugin 'libjanus_sip.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Plugin 'libjanus_nosip.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Plugin 'libjanus_videocall.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Plugin 'libjanus_streaming.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Plugin 'libjanus_duktape.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Plugin 'libjanus_lua.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Plugin 'libjanus_recordplay.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Plugin 'libjanus_audiobridge.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Plugin 'libjanus_voicemail.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: Loading plugin 'libjanus_ustreamer.so'...
Aug 22 18:53:07 raspberrypi janus[1006]: Joining Janus requests handler thread
Aug 22 18:53:07 raspberrypi janus[1006]: == ustreamer/main      -- Initializing PiKVM uStreamer plugin 5.38 ...
Aug 22 18:53:07 raspberrypi janus[1006]: == ustreamer/config    -- Reading config file '/etc/janus/janus.plugin.ustreamer.jcfg' ...
Aug 22 18:53:07 raspberrypi janus[1006]: == ustreamer/audio     -- Probing PCM capture ...
Aug 22 18:53:07 raspberrypi janus[1006]: == ustreamer/audio     -- PCM capture is available
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Plugin 'libjanus_videoroom.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: == ustreamer/video     -- No active watchers, memsink disconnected
Aug 22 18:53:07 raspberrypi janus[1006]: Transport plugins folder: /usr/lib/arm-linux-gnueabihf/janus/transports
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Transport plugin 'libjanus_rabbitmq.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: Loading transport plugin 'libjanus_websockets.so'...
Aug 22 18:53:07 raspberrypi janus[1006]: libwebsockets logging: 0
Aug 22 18:53:07 raspberrypi janus[1006]: Websockets server started (port 48002)...
Aug 22 18:53:07 raspberrypi janus[1006]: JANUS WebSockets transport plugin initialized!
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Transport plugin 'libjanus_mqtt.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Transport plugin 'libjanus_http.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Transport plugin 'libjanus_pfunix.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi janus[1006]: WebSockets thread started
Aug 22 18:53:07 raspberrypi janus[1006]: [WARN] Transport plugin 'libjanus_nanomsg.so' has been disabled, skipping...
Aug 22 18:53:07 raspberrypi systemd[1]: Started Janus WebRTC gateway.
jotaen4tinypilot commented 1 year ago

@mtlynch would you consider this to be blocking in regards to https://github.com/tiny-pilot/tinypilot/pull/1579, or shall/can we move forward anyway? (I haven’t requested a review on that PR yet.)

On the one hand, I think we have to address this problem before releasing, on the other hand, the STUN logic is “internal” / WIP for the time being, so it’s not exposed to customers yet.

mtlynch commented 1 year ago

@jotaen4tinypilot - Nope, not blocking that PR, but blocking for adding support for STUN to the web UI.

mtlynch commented 1 year ago

Is there another way of fixing this without bringing back https://github.com/tiny-pilot/janus-debian ?

I'd been considering bringing that repo back anyway because I'm worried that adding a whole backports repo to apt-get might have unintended consequences.

jotaen4tinypilot commented 1 year ago

Is there another way of fixing this without bringing back https://github.com/tiny-pilot/janus-debian ?

I think that would be worth to investigate. I haven’t looked into it yet, so e.g. I haven’t confirmed the After=network-online.target “workaround” from the Janus issue. Although that suggestion seems sensible, there might be other solutions that would be easier for us to achieve.

mtlynch commented 1 year ago

I made a test build of Janus with After=network-online.target:

https://p.tinypilotkvm.com/-Jdbait87RZ/janus_20230823134012-1_armhf.deb

https://github.com/tiny-pilot/janus-debian/pull/18

Also, with Docker Layer Caching enabled, the Debian build drops from 20m to 20s, so quite a speedup!

image

jotaen4tinypilot commented 1 year ago

The systemd documentation suggests to use both After=network-online.target and Wants=network-online.target, in order to ensure network connectivity.

I just tried out both variants on device:

So by adding the following code to /lib/systemd/system/janus.service, I could see that the Janus systemd service came up after booting.

After=network-online.target
Wants=network-online.target
jotaen4tinypilot commented 1 year ago

One additional note: Using network-online.target might have unintended side-effects, since it can delay the boot procedure, or cause other glitches during startup. From reading that FAQ section, we might have to be careful to just depend on network-online.target as a general setting for the Janus service. Maybe it would be safer to only use it when STUN is enabled? At least we probably should do some testing otherwise, to ensure that this setting wouldn’t affect TinyPilot in suboptimal network situations, e.g. when DHCP service is laggy, or the network is otherwise unreliable.

mtlynch commented 1 year ago

Oh yeah, that seems like it could create some headaches.

Maybe it would be safer to only use it when STUN is enabled?

We could, but having to dynamically rewrite and reload the systemd service definition every time we configure Janus seems like it's going to be a big pain.

Maybe the easier solution is to just give it more liberal restarts like we do with load-tc358743:

https://github.com/tiny-pilot/tinypilot/blob/aa80dbdd3e58e79401008665f2199d98fa14b3d0/ansible-role-ustreamer/templates/load-tc358743-edid.systemd.j2#L4-L6

jotaen4tinypilot commented 1 year ago

We could, but having to dynamically rewrite and reload the systemd service definition every time we configure Janus seems like it's going to be a big pain.

Is it, though? We already rewrite the Janus config itself, which we render from our template, and we have to restart the Janus systemd service afterwards to effectuate the change. (See https://github.com/tiny-pilot/tinypilot/pull/1579.) So assuming that the Janus service file would also live within the tinypilot repo directly, it’s “just” another render-template invocation, right?

Maybe the easier solution is to just give it more liberal restarts like we do with load-tc358743:

I’ve successfully tried with the following service config parameters: (Stripped irrelevant parameters for conciseness.)

[Unit]
StartLimitIntervalSec=300
StartLimitBurst=20
After=network.target

[Service]
RestartSec=1

I.e., 20 retries in 300 seconds max, with a 1 second delay between retries.

The 1 second backoff seems crucial, otherwise the retries occur too rapidly, and it would hence give up “too early” when reaching the burst limit. We probably should consider an even higher burst limit, to provide more margin to account for the unpredictability of networking.

Interim summary

It looks like there are two options on the table:

Of course we could also combine both approaches.

In any event, it looks like we have to take control of the Janus service file one way or the other. At least no other option occurred to me how we could solve this. To me, it’s still a bummer that Janus behaves that way, but of course there is no point in arguing about that, since we just have to deal with how it is.

mtlynch commented 1 year ago

Is it, though? We already rewrite the Janus config itself, which we render from our template, and we have to restart the Janus systemd service afterwards to effectuate the change. (See https://github.com/tiny-pilot/tinypilot/pull/1579.) So assuming that the Janus service file would also live within the tinypilot repo directly, it’s “just” another render-template invocation, right?

Yeah, I guess "big pain" is an exaggeration. But it prevents us from using dh_installsystemd like we do for other services.

Let's punt this bug until 2.6.2. I want to cut 2.6.1 in the next couple of weeks, and the Janus stuff is starting a cascade of additional tasks that's going to push out the release date. We made some progress in that Janus can integrate STUN settings if the user edits their settings.yml, so we can offer it as a manual solution until we add in UI support.

mtlynch commented 1 year ago

Per our dev meeting today, we can probably adjust our systemd config without having to build our own Janus package again:

https://askubuntu.com/questions/659267/how-do-i-override-or-configure-systemd-services/659268#659268