Video frezes for 5-10 sec when use VP9 or VP8-simulcast

vpoddubchak commented 4 years ago

Description

When we use VP9 codec, we constantly see video freezes during session. It happens also on pure licode with minor changes (codec and bitrate)

Steps to reproduce the issue:

Deploy licode from docker image
Change default codec to vp9 (in ./licode/rtp_media_config.js)
Change default resolution to 1280x720 ( in ./licode/extras/basic_example/public/script.js)
Change default bitrate to 1.5 Mbit, MaxBitrate to 3Mbit (in ./licode/scripts/licode_defaults.js)
Start basic example on external server with real network conditions (not locally)
Connect 2 participants and wait few minutes

Describe the results you received: One of participant see video freezes on remote video

From webrtc-internals I see that bitrate drops to 0 on receiver side (as you can see packets lost not always a reason of that): VP9_receiver then it start sending PLIs and NACKs: VP9_receiver_pli Each time, after 4th PLI it recovers.

At the same time, on publisher side bandwidth is stable. Count of received PLIs increases by 2. Looks like SFU is waiting for PLI pair and then send it to publisher. Also no NACKs at all: VP9_Publisher

jcague commented 4 years ago

Based on the graphs it clearly shows that we might be messing up sequence numbers for VP9. I guess you don't see the same behavior when using VP8, right?

Another question, are you enabling simulcast or some other functionality?

vpoddubchak commented 4 years ago

With VP8 all is fine. No simulcast. In basic example I just changed resolution and bitrate (720p 1.5Mbit).

ashar02 commented 4 years ago

@jcague what is the workaround for this?

lodoyun commented 4 years ago

@jcague what is the workaround for this?

The workaround is to use VP8 :) We are focused on VP8 and that's what we use in our deployments. While VP9 is interesting and we have things in place to make it work, I wouldn't avise using it in production at this time with Licode. That said, all the reports are welcome and we will use them to know what can be failing next time we prioritize work and decide to spend some time tweaking it.

vpoddubchak commented 4 years ago

We start seeing same problem with VP8 simulcast. Reproduced on pure licode, deployed in docker. with simulcast=true option. Changed only default and max bandwidth: (768kbps and 3000kbps) Problem happens time to time. Video freezes to 10 sec on receiver side. (in the same time, other subscribers see good video): image (3)

Looks very similar to problem with VP9. Let me know if you need more information.

jcague commented 4 years ago

@vpoddubchak what version of Chrome are you using? Does it happen to you also in old versions? I know Google recently (v80) changed some internal things in Simulcast, like the number of temporal layers, and that might be causing issues in the way we handle quality layer switches.

vpoddubchak commented 4 years ago

I'm using Chrome v80. We did not see such problem in previous versions. And to prove it I did such test in Chrome v78 and Chrome v80:

Deploy licode from docker on remote server
Open 2 pages on Windows 10 Chrome (both in Chrome of one version): https://server.com/?simulcast=true
Simulate 10% incomming packets drop (to speed up testing process) by tool: http://jagt.github.io/clumsy/

Results for Chrome 78: image (4)

Results for Chrome 80: image (5)

As you can see, Chrome 78 reacts much better - no drops to 0 bitrate.

Is it possible to fix on licode side ?

vpoddubchak commented 4 years ago

jcague commented 4 years ago

yes, I think that change is affecting Licode, but we might be doing things wrong in Licode because it shouldn't make the videos freeze I think. We will work on this in the following days/weeks.

vpoddubchak commented 4 years ago

I have reproduced it on Chrome 78 on Ubuntu 18. Possible reason why it is harder to reproduce is previous bug when video bitrate is constantly ~ 150kbps, so packet lost did not affect it much. Also I investigate it a bit, this is what I know for now:

ConnectionQualityCheck detects video_fraction_lost is more than kHighVideoFractionLostThreshold (=20 * 256 / 100) and set level=ConnectionQualityLevel::HIGH_LOSSES
QualityManager receives event about level changed and set next_temporal_layer=0 and below_min_layer = true which lead to switch-ON slide-show mode
after ~10 seconds it reverts back to normal

Logs:

[erizo-310d03ec-08a1-d397-8a51-93398d3f4886] 2020-02-28 10:46:14,303  - ERROR [0x7f2545df1700] bandwidth.ConnectionQualityCheck - ================== video_fraction_lost = 84
[erizo-310d03ec-08a1-d397-8a51-93398d3f4886] 2020-02-28 10:46:14,303  - DEBUG [0x7f25455f0700] rtp.QualityManager - ==================onConnectionQualityUpdate 0
2020-02-28 10:46:14,303  - DEBUG [0x7f25455f0700] rtp.QualityManager - message: Calculate best layer, estimated_bitrate: 391110, current layer 0/2, min_requested_spatial 0
2020-02-28 10:46:14,303  - DEBUG [0x7f25455f0700] rtp.QualityManager - Bitrate for layer 0/0 81381
2020-02-28 10:46:14,303  - DEBUG [0x7f25455f0700] rtp.QualityManager - Bitrate for layer 0/1 134946
2020-02-28 10:46:14,303  - DEBUG [0x7f25455f0700] rtp.QualityManager - Bitrate for layer 0/2 134946
2020-02-28 10:46:14,303  - DEBUG [0x7f25455f0700] rtp.QualityManager - message: below_min_layer 1, freeze_fallback_active_: 0
2020-02-28 10:46:14,303  - DEBUG [0x7f25455f0700] rtp.QualityManager - message: Setting slideshow fallback, below_min_layer 1, spatial_layer 0,next_spatial_layer 0 freeze_fallback_active_: 1, min_requested_spatial_layer: 0,slideshow_below_spatial_layer_ -1
[erizo-310d03ec-08a1-d397-8a51-93398d3f4886] 2020-02-28 10:46:14,303  - DEBUG [0x7f25455f0700] rtp.QualityManager - message: Layer Switch, current_layer: 0/2, new_layer: 0/0
2020-02-28 10:46:14,303  - DEBUG [0x7f25455f0700] rtp.QualityManager - message: Is padding enabled, padding_enabled_: 0

Questions:

Is it a bug or just reaction on packet lost ?
How kHighVideoFractionLostThreshold is defined ?, can It be changed (*2) to minimize such situations ?
Is it possible that video_fraction_lost now is bigger than in previous versions of Chrome (before 78 at least) ?

jcague commented 4 years ago

That is a mechanism to better respond to high packet losses, so it's not a bug anyway. You can even disable it but we don't recommend it. Instead, you can use the event we generate in the client side to let the user know he has some connectivity issues.

jcague commented 4 years ago

As I said before this is a feature in Licode to better send traffic through lossy networks, so I close this issue.

lynckia / licode

Video frezes for 5-10 sec when use VP9 or VP8-simulcast #1537