matrix-org / waterfall

A cascading stream forwarding unit for scalable, distributed voice and video conferencing over Matrix
Apache License 2.0
99 stars 5 forks source link

Figure out why the grey frame after joining the call is shown for so long #131

Closed daniel-abramov closed 1 year ago

daniel-abramov commented 1 year ago

It takes on average 2 seconds to disappear, but sometimes it takes a bit more. We have to check if it's some sort of a bug in the SFU or something different.

daniel-abramov commented 1 year ago

@SimonBrandner, @robintown do you by chance know which state the grey frame corresponds to? (in EC)

AFAIK, we never display grey frames in the full-mesh mode, I would like to know when we do this in the SFU mode. Is it when the track negotiation has finished, but no RTPs arrived yet?

According to webrtcforthecurious, the ontrack is called when the first RTPs arrive, which means that if we show the video tile after getting a track, we potentially should be able to display something at the moment (unless the timestamps are somehow off?)

But according to the MDN, it's a bit different.

My current hypothesis is that the EC does get the RTPs, that's why the callback is called and we display the tile, but since the first packet is not necessarily a keyframe (see https://github.com/matrix-org/waterfall/issues/100), nothing is said until the key frame arrives. In this case, this bug seems to be an extension (or maybe even a duplicate) of https://github.com/matrix-org/waterfall/issues/100, but I would need a bit more input to get a broader perspective and to understand if it's worth addressing https://github.com/matrix-org/waterfall/issues/100 first.

daniel-abramov commented 1 year ago

UPD. After running it with OpenTelemetry and also trying to match it with the logs, there are several interesting findings I've done. Essentially, the grey frame is shown when there are no RTP packets received on a track. If we don't receive any RTP packets at the beginning of the call (i.e. us joining the call and trying to display other participants' videos or a new participant joining the call), then that's where we see a grey frame. If the frames were received on a certain simulcast layer but stopped being sent by the client after a certain time, it will be observed as a frozen frame.

Essentially it happens because although we do get informed about new tracks/transceivers, we don't know upfront when and how often the peer's browser will send RTP packets on a certain simulcast layer. The browser adjusts this behavior based on many factors and we can't assume that we always get it.

The solution is to track the state of each publisher and automatically switch between layers by the SFU. This would solve the issues caused by long grey frames for the participants and frozen participants' videos.

It's important to note that short-term grey frames (1-2 seconds) that happen right after joining the call happen for a different reason. I've noticed that each time we join, there are a couple of logs that we always get on the SFU side:

INFO[2023-03-27T23:10:51+02:00] New subscription                              conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=LTWKPUMJQI layer= subscriber="@copper-fluffy-constrictor:call.ems.host/VIHCKRDRXL" track=9950e423-c103-472f-b81e-06de083d30ec user_id="@teal-electoral-gorilla:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] New subscription                              conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=LTWKPUMJQI layer=low subscriber="@copper-fluffy-constrictor:call.ems.host/VIHCKRDRXL" track=278e9fbf-6a0d-4178-80df-b3724de6370e user_id="@teal-electoral-gorilla:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] Renegotiating, sending SDP offer (3 streams)  conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=VIHCKRDRXL user_id="@copper-fluffy-constrictor:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] New subscription                              conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=LTWKPUMJQI layer= subscriber="@green-accurate-caribou:call.ems.host/QAKNDEILIZ" track=9950e423-c103-472f-b81e-06de083d30ec user_id="@teal-electoral-gorilla:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] New subscription                              conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=LTWKPUMJQI layer=low subscriber="@green-accurate-caribou:call.ems.host/QAKNDEILIZ" track=278e9fbf-6a0d-4178-80df-b3724de6370e user_id="@teal-electoral-gorilla:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] Renegotiating, sending SDP offer (3 streams)  conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=QAKNDEILIZ user_id="@green-accurate-caribou:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] New subscription                              conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=LTWKPUMJQI layer= subscriber="@pink-middle-panther:call.ems.host/KBSMMCHYQX" track=9950e423-c103-472f-b81e-06de083d30ec user_id="@teal-electoral-gorilla:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] New subscription                              conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=LTWKPUMJQI layer=low subscriber="@pink-middle-panther:call.ems.host/KBSMMCHYQX" track=278e9fbf-6a0d-4178-80df-b3724de6370e user_id="@teal-electoral-gorilla:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] Renegotiating, sending SDP offer (3 streams)  conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=KBSMMCHYQX user_id="@pink-middle-panther:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] Renegotiation answer received                 conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=QAKNDEILIZ user_id="@green-accurate-caribou:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] Renegotiating, sending SDP offer (3 streams)  conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=QAKNDEILIZ user_id="@green-accurate-caribou:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] Renegotiation answer received                 conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=VIHCKRDRXL user_id="@copper-fluffy-constrictor:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] Renegotiating, sending SDP offer (3 streams)  conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=VIHCKRDRXL user_id="@copper-fluffy-constrictor:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] Renegotiation answer received                 conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=KBSMMCHYQX user_id="@pink-middle-panther:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] Renegotiation answer received                 conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=QAKNDEILIZ user_id="@green-accurate-caribou:call.ems.host"
INFO[2023-03-27T23:10:51+02:00] Renegotiation answer received                 conf_id=1674726523924lGe9hGSoDBvHPAfL device_id=VIHCKRDRXL user_id="@copper-fluffy-constrictor:call.ems.host"

Basically, each time the subscription gets added, we add a track to the peer connection, this causes a renegotiation process to start. Until the renegotiation completes, the grey frame is seen in the Element Call. We could probably work around it by not displaying a grey frame, but instead displaying a more useful status. As for preventing the renegotiation, the only thing we could do is to allocate a set of transceivers upfront that would allow us to skip the renegotiation stage.

daniel-abramov commented 1 year ago

Interesting note, not sure that it's related, but it happened every time where the browser on the sender's side did not send RTPs on a certain layer, we've observed the following errors printed by Pion in the stdout/stderr:

pc ERROR: 2023/03/27 21:56:20 Incoming unhandled RTP ssrc(2954160688), OnTrack will not be fired. incoming SSRC failed Simulcast probing