meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.26k stars 2.48k forks source link

[1.x] Videoroom relays packets from two simulcast substreams #2899

Closed OxleyS closed 2 years ago

OxleyS commented 2 years ago

What version of Janus is this happening on? 1.x, commit ee313ac2d855dd33a9ebc402386397340c9914c1

In seemingly very particular circumstances, Janus routes packets from two simulcast SSRCs to a single subscriber. When this happens, in LOG_VERB the log is spammed with

SSRC changed, 1619259369 --> 2561054186
Computed offset for RTP timestamp: 2
SSRC changed, 2561054186 --> 1619259369
Computed offset for RTP timestamp: 539

for several seconds before calming down. During this period, the subscriber receives corrupted video.

The publisher's video sender is configured with three sendEncodings, with RIDs h,m,l in that order, but due to the video resolution being 768x432, and the bitrate constraints set on the room, Chrome seems to be silently disabling the encoding with RID l.

I briefly looked into the code - when this happens, at this point in the code, ps->vssrc contains [0, 1619259369, 2561054186]. That causes this case to be hit when relaying packets from both SSRCs, causing packets to be unconditionally relayed.

A Node project reproducing the problem can be found here. It's a modified version of the repro project for https://github.com/meetecho/janus-gateway/issues/2775, so the setup should be all the same.

lminiero commented 2 years ago

I guess the } else if(packet->ssrc[0] != 0) { check should really be a } else if(ps->simulcast) { check, exactly because you're in a scenario where index 0 is currently empty but simulcast is in use anyway and so should be processed accordingly. Can you check if that fixes it?

lminiero commented 2 years ago

If it does, I guess we should add a boolean simulcast property to janus_videoroom_rtp_relay_packet like the svc we have already, since checking ps->simulcast may be frail (according to the checks we have before, ps could be NULL, even though it shouldn't).

OxleyS commented 2 years ago

Changing the condition to ps->simulcast does indeed fix the issue, I don't get the log messages anymore. There does seem to be a separate problem in this scenario where the video starts to progressively lag behind farther and farther, but I'll investigate that more and file a separate issue if I find anything.

lminiero commented 2 years ago

There does seem to be a separate problem in this scenario where the video starts to progressively lag behind farther and farther, but I'll investigate that more and file a separate issue if I find anything.

I suspect it has something to do with the fact that there's no l substream being received, as our simulcast code works under the assumption that lower substreams are always available, and something we can fallback to. I managed to replicate the same problem by just setting active: false for l in janus.js, and video starts slowing down when you manually select the non-existing substream 0 (and apparently cannot be recovered even when you switch back to one of the layers that does exist instead).

lminiero commented 2 years ago

I think the lowest substream is not being sent in your case due to the weird resolution you chose. I see you ask for 768x432, and then use a 2.4 scale-down factor for the lowest substream: if the webcam is exactly that resolution, it should give 320x180 (which to my knowledge is the smallest resolution the browser will accept for simulcast), but if the browser ends up giving you a different resolution that's close to that one, the low substream will have a resolution the browser is not happy with, and it will disable it. As such, it's something that should first of all fixed on the client side.

That said, in my tests where I explicitly set active: false for the lowest substream, the browser always only sends me substream 1 (m), and never 2 (h), even though I don't have any constraints on the bitrate in the room or the publisher. This makes me think it could be a problem in the browser. @atoppi is checking RTCP as he thinks something's broken there too. In that case, I don't think there's anything we can do to fix that.

atoppi commented 2 years ago

After a packet capture analysis, it seems that the issue is with the RTP clock of the source (the browser in this case). In this specific scenario (disabled l layer) the RTP is not keeping the expected pace. The video is slowing down and the difference between real time and RTP time is over 1 sec.

lminiero commented 2 years ago

I think that confirms that it's a problem in how the browser sends media under those circumstances, then, not Janus, so I'll close.

lminiero commented 2 years ago

I'll push a commit shortly to address the check I mentioned in the first response, though.