Video problem in chrome v 72

ashar02 commented 5 years ago

Briefly describe the problem you are having in a few paragraphs. After release of chrome 72 video is appearing blurry and pixelated.

Steps to reproduce the issue: I joined the room on https://chotis2.dit.upm.es with two participants. After 15 - 30 seconds, remote video becomes blurry and pixelated. Even I installed my own licode server from code. It happens with latest code as well.

Describe the results you received: On investigation using webrtc-internals. I come to know that there is big packet loss which is causing this issue on publisher end.

Describe the results you expected: As I have sufficient bandwidth available, it should not face such number of packet losses.

Additional information you deem important (e.g. issue happens only occasionally): I some how manage to install chrome 68 - 70 and on same condition there is no packet loss and thus video bitrate and quality are good on these.

Licode commit/release where the issue is happening I used 'try room' on licode site. As well as used commit d92f97308b819de38dd7fefbbf939820c7d675d0 for my own installation.

Additional environment details (Local, AWS, Docker, etc.):

ashar02 commented 5 years ago

@jcague can you please see this issue. This is coming every time with just two participants.

jcague commented 5 years ago

Does it happen to you when there is only 1 participant?

Anyway, I tried to reproduce this issue with Chrome v72 and chotis2 but everything worked right for me. I'm using MacOS and tried with 1 and 2 participants, so it's going to be difficult to know the reason for such packet losses. What is your RTT (you can see it in another graph inside webrtc internals).

ashar02 commented 5 years ago

Here is the full graph. Upon several tests, I came to know that some times it comes in 30 seconds and sometimes it comes in 1 to 2 mins. You can see whole graphs. I tested using two participants on Mac and Windows OS. Anything else I can share with you to make it narrow down?

ashar02 commented 5 years ago

erizo-80c01c67-5b68-11c5-c1f7-a30db202a263.log This is the log file from licode server installed on my machine facing same issue.

ashar02 commented 5 years ago

@jcague will it helpful to you to guess from above attached log file?

jcague commented 5 years ago

could you repeat the test with just one participant and looking at what happens to the graphs?

ashar02 commented 5 years ago

okay, let me test and share the results with you shortly.

ashar02 commented 5 years ago

I tested it for many minutes. If i am alone in the room (no participant except me). Then there is no packet loss and video bitrate is around 300k. As soon as one participant joins the room beside me packet loss started after some time. Then I removed the recently joined participant from room. And you can see in the graph that bitrate goes to 300k again and packet lost also goes to zero.

I test this scenario from different region too and results are same (Dubai, Singapore, Pakistan). If i use electron with chrome 68. All publishing goes perfectly. Even I tried Notify-break-reover scheme so that subscriber does not effect the publisher . But still same packet drops. Any guess / hint to investigate further?

ashar02 commented 5 years ago

I used this command to get further logs: open /Applications/Google\ Chrome.app --args --enable-logging --v=1 --vmodule=libjingle/source/talk/=3 Here are the logs: chrome_debug.log

You can see that as soon as subscriber comes packet drop started with message 'transient error net::ERR_ADDRESS_UNREACHABLE. Dropping the packet.'

Also [26659:26639:0221/124121.219679:INFO:remote_bitrate_estimator_abs_send_time.cc(165)] Probe failed, sent at 692093 bps, received at 572307 bps. Mean send delta: 10.75 ms, mean recv delta: 13 ms, num probes: 4

[26723:37131:0221/131211.194388:VERBOSE1:video_stream_encoder.cc(984)] OnBitrateUpdated, bitrate 155854 packet loss 221 rtt 621

ashar02 commented 5 years ago

please guide @jcague

Piasy commented 5 years ago

I also tested on chrome m72 on macOS, but I can't reproduce the problem, I only have the problem in Android and iOS native client.

ashar02 commented 5 years ago

@paericksen You mean to say that licode is taking different decision related to bandwidth in case of Mobile vs Web?

Piasy commented 5 years ago

I'm not sure, but I can't reproduce it in chrome M72, with the same server.

Oh no, I reproduce it after 3 minutes!!!

ashar02 commented 5 years ago

@jcague please guide on this?

jcague commented 5 years ago

@ashar02 can you please try using nicer in your local environment? you should enable it in the licode_config.js file. config.erizo.useNicer = true; // default value: false

ashar02 commented 5 years ago

sure!!

ashar02 commented 5 years ago

Tried using config.erizo.useNicer = true; and restarted the server too. Same results.

ashar02 commented 5 years ago

@jcague any other hint to try?

ashar02 commented 5 years ago

@lodoyun any guess from your end?

Piasy commented 5 years ago

Guess we have to RTFSC ourself :(

ashar02 commented 5 years ago

@Piasy yes, have to do ourself in same scenario jitsi is working perfectly...

Piasy commented 5 years ago

@ashar02 I test jitsi and janus gateway, they perform very bad under weak network condition, about 300ms ping RTT and 3% packet loss.

jcague commented 5 years ago

@Piasy @ashar02 this is currently in our radar, but as I told you I can't reproduce it with macOS and Chrome v72 and our users don't experience the low bitrate, probably because we use Simulcast. We're currently working on fixing resubscriptions with Single Peer Connections, which is affecting clients in our production environment. But we'll work on this issue asap. Anyway, you're more than welcome on investigating in the meantime. All the info, like the logs you gathered will help us to find the cause.

jcague commented 5 years ago

@Piasy @ashar02 could you please test with the PR I referenced above? don't know yet if that's the better solution but I wanted to try it before going on

Piasy commented 5 years ago

Sure I'll try it later today.

ashar02 commented 5 years ago

@jcague let me check it shortly...

ashar02 commented 5 years ago

@jcague on different quick runs from chrome 72 it is running perfectly with this PR. Great job!! Tonight, I shall do detail testing on this.

ashar02 commented 5 years ago

@jcague Previously, I did testing on LAN. Checked with upto 6 participants and all went perfect. Now I am doing my test on Internet using same chrome 72. Upto two participants are are going well. But on adding 3rd participant in room remote video frame rate dropped and finally stuck on every end.

You can see this PLI and NACK counts are increasing to bad number that is causing the video to stuck. Good thing is that packet loss is zero now. Please suggest further?

Piasy commented 5 years ago

@jcague I did some tests on Android and iOS native client too, both of them using the 26131 revision, the problem still exist, or even worse, when two peers join the same room, the video stuck shortly, but the audio is good.

Below is the server log, we can see that the measured bitrate drops down quickly.

licode.log

yayuntian commented 5 years ago

@lodoyun Can you improve the priority resolution? Frequently reproduced in our environment （Android & web）

jcague commented 5 years ago

@ashar02 I dont' think that's the same issue, but it might be caused by the logic in added. That's why this is a POC, because it might break the bandwidth estimation algorithm if there are packet losses in some subscriber.

ashar02 commented 5 years ago

@jcague I think I should share the statistics of subscriber instead of publisher where actually video freezes and audio keep coming perfectly. Let me investigate further on subscriber end and let you know today!! Point to be noted here is that; this video freeze was not coming at subscriber end before taking above PR. It gets blurry but never freezes previously.

yayuntian commented 5 years ago

@ashar02 We try to discard the RTCP RR packets calculated by erizo itself, and only use the RTCP RR packets of the peer client to solve the problem of one-to-one call. just a little bit of modification， You can verify in your environment，

      if (nacks_enabled_ && generator_it->second->nack_generator != nullptr) {
        generator_it->second->nack_generator->addNackPacketToRr(rtcp_packet);
      }
-      ctx->fireWrite(std::move(rtcp_packet));
+ //      ctx->fireWrite(std::move(rtcp_packet));

looking forward to your verification

ashar02 commented 5 years ago

In my environment, after yesterday PR publishing seems running good now. What I am observing currently is that at subscriber end huge packet loss is causing the video got stuck. I am using Notify-break-reover scheme so publisher is going independent of subscriber. My bandwidth is sufficient so subscriber video stream should not show such a huge packet loss.

@yayuntian let me try above change and let you know.

ashar02 commented 5 years ago

@yayuntian No effect of above change. I think we have to see the huge packet loss at subscriber end. Is this packet loss at subscriber stream belongs to SR of rtcp?

jcague commented 5 years ago

the solution commented by @yayuntian works for one-to-one but not for one-to-many scenarios. Also, notice that my PR is a PoC and it aims to fix the issue commented by @ashar02 and represented in the first webrtc-internals page. But it might affect bandwidth estimation. A better solution needs to be implemented to work with Chrome's bwe properly but we need to design it first. Anyway, your tests confirmed us that the reason for this issue is that there is a change in the sender BWE in Chrome 72, and it now depends on packet losses reported in RRs. It seems like we need to refactor that part of the code and/or check whether we are not forwarding RRs properly. Another alternative in the short term for one-to-many streams is to enable simulcast, which is working well (we're currently using it in our env).

ashar02 commented 5 years ago

@jcague got your points. Just tell me on thing? Packet loss at subscriber end is calculated based on rtcp SR from licode?

jcague commented 5 years ago

Licode does not generate its own SR, it just forwards publisher's SRs. To calculate packet loss the subscriber end needs the SR, you're right.

yayuntian commented 5 years ago

@jcague Continuous testing, found that the RR package was removed, there is still a problem, can not solve the one-on-one situation， bitrate will still decrease

Android log display REMB abnormal

03-12 20:23:57.789 14967-3502/com.ysten.education V/libjingle: (rtcp_receiver.cc:1345): Incoming REMB: 116414
03-12 20:23:57.790 14967-3502/com.ysten.education V/libjingle: (rtcp_receiver.cc:1345): Incoming REMB: 116414
03-12 20:23:57.798 14967-3591/com.ysten.education V/libjingle: (video_sender.cc:277): Drop Frame target bitrate 34777 loss rate 
03-12 20:23:57.812 14967-3573/com.ysten.education V/libjingle: (vie_encoder.cc:247): OnBitrateUpdated, bitrate 34777 packet loss 0 rtt 97
03-12 20:23:57.849 14967-3502/com.ysten.education I/libjingle: (rtp_stream_receiver.cc:307): Packet received on SSRC: 2278783019 with payload type: 107, timestamp: 1692360443, sequence number: 37938, arrival time: 1552393437848, abs send time: 3810222
03-12 20:23:57.870 14967-3591/com.ysten.education V/libjingle: (video_sender.cc:277): Drop Frame target bitrate 34777 loss rate 
03-12 20:23:57.915 14967-3502/com.ysten.education V/libjingle: (rtcp_receiver.cc:1345): Incoming REMB: 33422
03-12 20:23:57.915 14967-3502/com.ysten.education V/libjingle: (rtcp_receiver.cc:1345): Incoming REMB: 33422
03-12 20:23:57.926 14967-3591/com.ysten.education V/libjingle: (video_sender.cc:277): Drop Frame target bitrate 34777 loss rate

ashar02 commented 5 years ago

@jcague We all think that licode is the best server in the world. Its beauty is the performance in low bandwidth situations. But from past months this bug (without simulcast enabled) is making the users of this repo in hard situation. So my suggestion would be to do two things in priority and you people will be the best:

Fix this issue in case of simulcast disabled. Your POC was good in direction.
Make VP9 functional so that bandwidth can be saved. Its the demand of the time.

Hope you people will pay attention to these as well on priority.

jcague commented 5 years ago

I completely agree on the prioritization. But there are other things that are even more important. We're still working on fixing many cases were SDP negotiation fails due to race conditions, for instance. But we'll work on this afterwards since like you say this bug may affect many people.

steven8274 commented 5 years ago

We met this problems too!If we use old version chrome, 67 for example, videos will not be poor.

vpoddubchak commented 5 years ago

We are also seeing this problem on our servers.

Our understanding of the problem:

Google changed BWE since v72 as was mentioned above. Details there: https://groups.google.com/forum/#!topic/discuss-webrtc/GqPl8rKpt7Q
Licode needs to be updated to handle this changes.
POC #1374 did not help (we tried) Please correct me if I’m wrong.

@jcague You mentioned that the issue is not reproduced with simulcast enabled. Can you explain why ? We still seeing this problem with Simulcast enabled, and it unfortately makes licode absolutely unusable with current version of Chrome (v74 for now).

jcague commented 5 years ago

this problem should not affect simulcast, at least the change I tested in the POC you mention, because in that case Licode does not forward RRs from subscribers to the publisher. We're designing different solutions for this problem that include the update of Google's BWE codebase in Licode and the aggregation of data from RRs to avoid having deltas between them.

steven8274 commented 5 years ago

I think that's a bug of webrtc.https://groups.google.com/forum/#!topic/discuss-webrtc/GqPl8rKpt7Q is post by me.Temperally, I write an adapter for rtcp forwarder and rtcp rr generator to make sure lost packets number will not decrease and use the source ssrc as sender ssrc.

vpoddubchak commented 5 years ago

@jcague, I think that problem is on subscribers side and this is why simulcast also does not work. Here is use case to prove it: Video conference with 3 participants (A, B, C). Simulcast enabled 3 layers.

At the beginning all participants see each other very well 480p30
After 3 minutes participant B see degradation of video from A: 480p30, 240p15, and finally 240p0 - freeze At the same time participant B see good video from C: 480p30
Participants C still has good quality from A and B: 480p30
The longer test - the more problems....

So, we have situations when some subscribers see frozen videos, but others - see good quality from the same publisher, it proves that problem is between Licode and subscriber. Also there is snapshot of receiver data from ackuaria. As you can see Estimated Bandwidth started degradation and it caused switch to less simulcast layer.

image (3)

@steven8274 Does your solution work ? Can you share more details ?

jcague commented 5 years ago

I think the issue @vpoddubchak mentions is a different bug, which might be solved by updating the webrtc library, which is another task we have in the backlog. @steven8274 thanks for the info, I was thinking more on aggregating data but I'll try what you said.

steven8274 commented 5 years ago

@vpoddubchak I haven't used simulcast so far, so I don't know the details in simulcast situations.

darkterrorooo commented 5 years ago

I am experiencing a similar problem, it may indeed be a webrtc bug. I analyzed the reason. When the chrome browser on macOS and the chome of windows make a video session.

then , if the camera of the windows is jittery, then the chrome of mac feedback a small REMB and keep small long time , so the video rate of windows very low and pixelated.

after did some other tests and found the problem lies in the chrome of macOS(verson:74.0.3729.131).

My solution is forbit the REMB feedback in licode

this is the log of licode:

max_bit 1000000 tar_bit 1000000, remain_bit 1320464 bitrate 1000000 remb 1000000
[erizo] remb 393004 ssrc 1

**//393004 is smaller than set, some time is less than  100000 ,  100kpbs** 

max_bit 1000000 tar_bit 1000000, remain_bit 393004 bitrate 393004 remb 393004
[erizo] remb 1320464 ssrc 1
max_bit 1000000 tar_bit 1000000, remain_bit 1320464 bitrate 1000000 remb 1000000
[erizo] remb 395242 ssrc 1
max_bit 1000000 tar_bit 1000000, remain_bit 395242 bitrate 395242 remb 395242
[erizo] remb 1320464 ssrc 1
max_bit 1000000 tar_bit 1000000, remain_bit 1320464 bitrate 1000000 remb 1000000
[erizo] remb 397548 ssrc 1
max_bit 1000000 tar_bit 1000000, remain_bit 397548 bitrate 397548 remb 397548

lodoyun commented 5 years ago

I'm testing a new solution here: https://github.com/lynckia/licode/pull/1413 Please give it a try and let me know how it goes.

lynckia / licode

Video problem in chrome v 72 #1364