meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.02k stars 2.45k forks source link

Videoroom A/V Sync Issues #1653

Closed kylefoley closed 4 years ago

kylefoley commented 5 years ago

We're seeing audio/video not synced - note the googCurrentDelayMS in the graph which seems to be growing. In this case, the audio was ahead of the video by 1s. The audio googCurrentDelayMS was about 50ms.

I can attach a WebRTC dump if necessary. Any help in debugging it would be appreciated appreciated. Thanks!

Janus v0.7.1 Video Codec - H264 Audio Codec - OPUS Browsers - Chrome v74 Using CoTURN MacOS on one side, Win10 on the other

image

lminiero commented 5 years ago

Possibly a duplicate of #1220 and #1463.

lminiero commented 5 years ago

As a side note, you're not saying which plugin this happens on.

kylefoley commented 5 years ago

It's in the title, using the videoroom plugin

kylefoley commented 5 years ago

This example was using simulcast, but we are also able to duplicate using non-simulcast as well

kylefoley commented 5 years ago

Here is the non-simulcast example. I was in a conference for <20 mins and this started happening. I have the dump for this as well if you'd like to see it.

Screen Shot 2019-06-04 at 10 20 07 AM Screen Shot 2019-06-04 at 10 20 14 AM
kylefoley commented 5 years ago

Sorry for the image spam, just trying to provide you guys with as much info as possible on this

Screen Shot 2019-06-04 at 10 57 05 AM
atoppi commented 5 years ago

Thanks @KyleFoleyImajion, the viewer webrtc-internals screenshots are valuable inputs but I'd like to observe the related publisher stream also. Is the publisher a browser? Would you mind sharing some specs about it and also the internals (if any) ? Just by looking at the viewer stats and focusing on the non-simulcast case, I can deduce the followings: 1) no packet has been lost (0 nacks, 0 loss) 2) the received framerate is not stable at all and the interframe delay is way too high for a 30fps video: possible (but not exhaustive) explanations are load stress on the server or on the publisher itself. Have you observed any cpu spike on your systems?

Finally I'd like to know if you have observed any special event in conjunction with the delay stat spike (also in Janus logs).

lminiero commented 5 years ago

2. possible (but not exhaustive) explanations are load stress on the server or on the publisher itself

Or the TURN server, if the publisher is going through there.

kylefoley commented 5 years ago

Thanks for the response!

-The publisher was a Win10 machine, running Chrome v74. -The subscriber was on MacOS running Chrome v74 as well. -I had two publishers in that conference and they were also both subscribed to each other. -User A <-> User B. Both publishing video

Both users were going to our Janus server hosted on AWS, bitrate capped at 512000 (set when calling publish, not set to the room). Our CPU for Janus was steady around 7%, no spikes from what we saw. I'll repro this and get you the Janus logs at some point today or tomorrow. We did not check the TURN CPU for spikes, but I will once I repro this.

We run CoTURN, but these two connections were just utilizing STUN it appears.

We didn't see/trigger any events when this happened, can't quite figure out what is triggering this.

Nothing special with our webrtc setup browser side, just a standard non-simulcast connection

atoppi commented 5 years ago

How can you tell that PeerConnections are not using TURN from those screenshots? How was the streaming experience overall? Can you confirm the framerate issues?

The delay spike seems to match the Interframe Delay spike. We need to understand who is slowing down the frame relaying. Janus? Publisher? CoTURN? Network?

UPDATE: Also check that your AWS instances have proper network capabilities.

lminiero commented 5 years ago

As a side note, make sure you're enabling transport_wide_cc_ext for your rooms if you're not doing it already: https://github.com/meetecho/janus-gateway/blob/master/conf/janus.plugin.videoroom.jcfg.sample#L29

kylefoley commented 5 years ago

I still have the WebRTC Chrome dump from that call, so I just went back and looked at what candidante it was using for the call. The streaming experience was pretty good, nothing stood out as being 'abnormal' or weird looking. We do have a framerate counter visible while in the call so I can confirm that it does jump around quite a bit but I don't visually notice it jumping around.

I ran this test with both computers on my home network connecting to AWS, I have a 1 gig up/down connection. This has happened on other networks as well, but I can't confirm if they were using STUN or TURN. Also our AWS instance is open to all network traffic

I did not have transport_wide_cc_ext enabled, I'm doing that right now and will see if I can repro that issue.

Thanks for your help!

atoppi commented 5 years ago

@KyleFoleyImajion should the issue persist, we might need a un-encrypted traffic capture on Janus. I suggest doing this by setting the flag no_webrtc_encryption in Janus config, and then launching Canary Chrome (both pub and sub) with the flag --disable-webrtc-encryption . Capture with tcpdump as usual and try to avoid other noisy network traffic.

kylefoley commented 5 years ago

Just a quick update: Enabling transport_wide_cc_ext causes some users framerates to plummet if they wave their hand around in front of their camera. Normal framerate for them is 15fps, waving their hand causes their framerate to go down to ~2fps. We still get the AV sync issues with transport_wide_cc_ext as well.

lminiero commented 5 years ago

Sounds like those devices have problems of their own. Trasport Wide CC is used by the browser to automatically adapt the bitrate when problems are detected (e.g., lost packets, high CPU usages, etc.) so if you get 2 fps, there are problems there.

lminiero commented 5 years ago

A couple things you can try as well:

  1. RTP forward auio and video to, e.g., gstreamer to render the incoming audio using something different than the browser, and check if the delay is there too;
  2. record audio and video to our .mjr format, postprocess audio and video and mux them together, again to see if the delay is perceived there as well.

Those can help figure out if the issue is indeed at the source or not (e.g., skewed stream).

kylefoley commented 5 years ago

Alright, so I waited unitl I could duplicate the AV sync issue and then started recording once I noticed the ` drift. The recording was perfectly in sync, but the live video was out of sync pretty badly. We're still going through this, just wanted to update you. This time I was using FF so I wasn't able to get a webrtc dump

atoppi commented 5 years ago

Thanks for the feedback, so issues on the publisher's end are very unlikely. Now focus on the viewer's side: try collecting a viewer pcap dump through Janus or use the before mentioned flags to disable webrtc encryption.

kylefoley commented 4 years ago

Apoligies for this taking so long!

Here's what I did to test/reproduce the AV delay

Janus setup: -Version 0.7.2 -TWCC Enabled -WebRTC encrytpion disabled

'Bot' setup -4 peers on a Windows machine running Canary v78 with --webrtc-encryption-disabled

My setup -Running on MacOS Canary with --webrtc-encryption-disabled -Captured TCPDump and webrtc-internals dump from this machine

You can see the delay climbing in this picture

Screen Shot 2019-07-30 at 3 35 22 PM

The only program I had running on my Mac during the TCPDump was Canary, so hopefully there is minimal other network traffic.

Hopefully this sheds some light into what is really going on. Thank you!

webrtc_internals_dump.txt

janus.pcap

ibc commented 4 years ago

Is the publisher sending video simulcast?

kylefoley commented 4 years ago

No simulcast at all

ibc commented 4 years ago

@lminiero @atoppi, the way libwebrtc relies on to sync received audio and video belonging to the same source is by looking at the msid value in their corresponding m sections. This is, the remote SDP that the viewer consumes must generate audio and video remote tracks that belong to the same MediaStream. I assume you already do it, but just in case.

@KyleFoleyImajion can you confirm that the remote SDP is like that? I mean a=msid lines or a=ssrc lines with msid attribute into them (depending on whether Unified-Plan or Plan-B is used).

kylefoley commented 4 years ago

Seems like the ssrc ID's are the same image

ibc commented 4 years ago

so that's not the issue

atoppi commented 4 years ago

Hi @KyleFoleyImajion I've requested the permission to download the pcap dump.

Focusing on the webrtc dumps:

lminiero commented 4 years ago

Any update on this? I don't see answers to Alessandro's questions.

rvo337 commented 4 years ago

Hello janusers :wave: I have some issue when i set audio/video codecs on create room (usually i use audiocodec - g722 and videocodec - vp8).

Lip sync reproduced on first seconds after connecting — googTargetDelayMs increment every second while value not equal 800 and dwells on this. While increment googTargetDelayMs together with that increment googCurrentDelayMs. All remaining time the audio/video is not synced.

This problem solved when i not set audio/video codecs on create room, but in my case i cant use default codecs (opus/vp8).

Janus commit: 85fe39ab1bb1f70a522cd475174979456b1e8d5a Compiled on: Thu Nov 7 19:53:47 MSK 2019 Video Codec - VP8 Audio Codec - G722

Using CoTURN (does not affect) MacOS Chrome (78.0.3904.87) on one side , iPhone 6s Safari on other side

image janus_lip_sync.txt — webrtc internals

Always available for you to solve the problem :hugs:

rvo337 commented 4 years ago

Hi :wave: I checked all available audio/video codecs combination — issue reproduced when i used g722 or isac16 audiocodecs

atoppi commented 4 years ago

Hi @rvo337, can you try the same audio-video combinations with different machines / OS ?

rvo337 commented 4 years ago

Hi @atoppi,

The problem was diagnosed in our application on different devices (we implemented the client part ourselves on ios/android/electron, without using janus.js). Then, to make sure that the problem is not in devices or client side implementation, I conducted experiments:

  1. The same devices was peer2peer connected via TURN server - lip sync problem is not reproduced
  2. The same devices was connected via Janus Gateway (videoroom plugin with default configuration files) and tested using the videoroom plugin demo page (janus.js was configured with address of our server where Janus Gateway is deployed). When option «audiocodec» have default value (line «audiocodec» just removed or commented out and have default value «opus») lip sync issue is not reproduced. But when i change only one option «audiocodec = g722 (or isac16)» — i have lip sync issue.

I have tried again with other device on other side

Environment: Janus commit: f426cef7e7f52e0bc2ff0ced8c8f4edbc007f517 Compiled on: Fri Nov 8 13:39:36 MSK 2019

MacOS Chrome (78.0.3904.87) on one side, laptop Lenovo AMD A4-5000 APU with Radeon(TM) HD Graphics 1.50Ghz, 4GB RAM Windows 10 64bit Chrome 78.0.3904.87 on other side

First attempt with option «audiocodec» = g722 — issue is reproduced image Webrtc internals dump from MacOs — janus_g722.txt

Second attemp with default configuration files («audiocodec» commented out, default value opus) — issue is not reproduced image Webrtc internals dump from MacOs — janus_default.txt

lminiero commented 4 years ago

Janus doesn't care about the codec: we don't transcode, and don't care, since we just relay packets. As such, if the codec is the root of the problem, it's a browser/client issue, not ours.

lminiero commented 4 years ago

Unless it has something to do with sampling rates? g722, like g711, uses a 8000 sampling rate: no idea what isac negotiates, but I can't really say I care, since nobody uses it. Maybe we're doing something wrong when updating RTP headers, where we may be expecting 48000 for Opus. We should update the sampling rate depending on what's being negotiated, but maybe we're only doing it for g711 and not g722. I'll look into it.

lminiero commented 4 years ago

c36e017cd580ca7d9994f1185e9254ad722312e9 should fix the issue for G.722. Isac uses a dynamic payload type, and so is harder to enforce, and a more complex fix that can wait.

rvo337 commented 4 years ago

I have tried two video calls and my issue did not reproduced 🎉 Today and tomorrow i will make more testing calls and tell you final result 🤗

rvo337 commented 4 years ago

Hello :wave: After test on different devices problem with increase target & minplayout delay on start video call not reproduced.

But i have new issue on iOS platform (developed using a library GoogleWebRTC - https://cocoapods.org/pods/GoogleWebRTC) in this case:

  1. Alice start video room with Bob (only audio)
  2. Bob enable video — lip sync issue not reproduced
  3. Bob disable video
  4. Bob enable video — in this moment lip sync reproduced and delay rapidly increasing If repeat steps 3, 4 - the delay will increase further image webrtc internals dump — webrtc_dump.txt

What could be causing the problem? In the past, we switched from the opus audio codec to g722 because g722 did not reproduce this problem when switching video.


UDP 10.11 22:00

Hi 👋 I tested iOS issue on peer2peer connection between two iphone devices (using this simple example that use GoogleWebRTC library — https://github.com/stasel/WebRTC-iOS) and reproduced problem, but after 5-10 seconds lip sync issue was resolved. Ready to provide all possible dump logs.

rvo337 commented 4 years ago

Hello :wave:

I did make the changes:

  1. In the function janus_ice_outgoing_rtcp_handle (https://github.com/meetecho/janus-gateway/blob/master/ice.c#L3568) - deleted all calls to the janus_ice_relay_rtcp_internal function;
  2. In the function janus_rtcp_filter did make allow to interval relay rtcp packets — RTCP_SR, RTCP_RR, RTCP_SDES (did remove out line https://github.com/meetecho/janus-gateway/blob/master/rtcp.c#L580) This changes work fine for me and lip sync issue on all platforms resolved. Also delay decreased from 300 to 100.

As far as I understand, the changes made by me are very critical. I am going to use these changes in my case (I work only with the videoroom plugin). I’m interested in what critical changes in Janus’s work can occur after the changes made? As far as I understand, my changes allow clients to transmit RTCP packets (SR, RR) through Janus?

Thank you :hugs:

lminiero commented 4 years ago

Your changes would never be accepted in the Janus core. We filter out SR/RR originated by plugins on purpose, since we terminate RTCP support ourselves. If you believe your plugin can do a better job, then please contribute fixes to the RTCP code in the core instead based on that.

rvo337 commented 4 years ago

I do not want my changes to be accepted. I will use these changes only for my own purposes. I wanted to ask you if there are any subtleties in the rtcp implementation that I may not know about, and what might cause problems in the future. :hugs:

chunqiulfq commented 4 years ago

Hello everyone, I use janus as the SFU server. The plugin used is developed by myself. There is no modification to the rtp data in the plugin. It is just a simple stream forwarding. I also have an out of sync.

I used the admin API to record the pcap file when it was out of sync:

I analyzed the error1.pcap on the pull side. Since it was recorded directly in janus using the admin API, I think the timestamp in rtp and the rtp timestamp in rtcp should be synchronously monotonically increasing.

In fact, the audio is like this: image

However, the rtp timestamp in video rtcp is slower than the timestamp in rtp by more than 2300ms (video clock rate is 90000), which is consistent with the time of the synchronization that I observed in the pull end: image

At this time, in the chrome://webrtc-internals page, the audio jitterBufferDelay/jitterBufferEmittedCount_in_ms is less than 100ms, but the video's jitterBufferDelay/jitterBufferEmittedCount_in_ms is over 2300ms (this time there is no screenshot, use the previous screenshot): video jitterBufferDelay/jitterBufferEmittedCount_in_ms

This shows that the rtp timestamp in the video rtcp end is wrong on the pull side.

So I looked at the send1.pcap on the push side, and confirmed that the timestamp in the rtp of the push-end video received by janus and the rtp timestamp in rtcp are synchronously monotonically increasing: 1573543619(1)

The video stream passed the janus, rtp and rtcp relative timestamps are not synchronized, and my plugin does not make any changes to the rtp timestamp, so I think janus has a bug in handling rtp timestamps.

lminiero commented 4 years ago

Sorry but I'm not going to look at data coming from older instances. Please replicate with master.

atoppi commented 4 years ago

Hi @chunqiulfq, while trying to duplicate the issue with master, could you please enable the captures from the beginning?

chunqiulfq commented 4 years ago

Of course, I will try to capture.

chunqiulfq commented 4 years ago

hi, Very lucky, I reproduced this question today. The problem is similar to what I described earlier.

I started capturing when I started pushing:

This time, I also got the webrtc dump file on the pull side:

On the push end, the timestamp in rtp and the rtp timestamp in rtcp are synchronously monotonically increasing.

At the pull end, the timestamp in rtp and the rtp timestamp in rtcp are out of sync.

At this point, I got a screenshot of jitterBufferDelay/jitterBufferEmittedCount_in_ms in chrome://webrtc-internals: Audio jitterBufferDelay/jitterBufferEmittedCount_in_ms: image

Video jitterBufferDelay/jitterBufferEmittedCount_in_ms: image

lminiero commented 4 years ago

Thanks, I'll have a look and let you know if I find anything weird in there.

lminiero commented 4 years ago

From a first look at the .pcap files, I already found something very weird, that I'm sharing already so that you guys can have a look too.

Looking at the two .pcap files, the payload types are 111=audio/100=video for the publish side, and 111=audio/102=video for the viewer side: nothing wrong in the payload type change, here. What's weird is that, if we just focus on audio for both dumps, on the viewer side we can see we send one with seq=1, but then 176 packets are NOT sent, since the next packet in the list has seq=178. While there may be different reasons for that (which I haven't investigated yet), the weird thing is that the timestamp change between the two makes no sense, as timestamp goes backwards, instead of increasing as it should. This simple table shows what's happening, displaying the sequence number and timestamp values for audio in both .pcaps:

Pub. seq Pub. ts Sub. seq Sub. ts
seq=7358 ts=3906136602 seq=1 ts=281880
seq=7359 ts=3906137562 (missing) (missing)
... ... ... ...
seq=7535 ts=3906306522 seq=178 ts=169920

As you can see, the next audio packet we send has a timestamp of 169920, which makes no sense: since we're jumping from sequence number 1 to 178, accounting for a timestamp step of 960 (for Opus), the new timestamp should have an increment of 169920, that is equal to 177*960, and incidentally exactly the difference between 3906306522-3906136602 (timestamp of the 178th and 1st packet on the publisher side). The fact that the new timestamp for audio is now exactly 169920, instead of being 281880+169920 as it should, makes me think we have some weird edge case in janus_rtp_header_update (or janus_rtp_switching_context) that causes all this.

To make everything weirder, for video this does not happen, since all packets are there (no drops as for audio), and the timestamps seem to increase as expected, but guess what the first timestamp on the viewer side is? 281880 exactly as for audio! May be a coincidence, or may be an indication of something we should investigate more.

I'll keep on digging, but I'd ask you guys to please also have a look at the code to see if you can spot anything weird: fresh eyes always help, especially when the tired ones have looked at it all over without spotting anything :grin:

lminiero commented 4 years ago

The plugin used is developed by myself.

Wait, I missed this part... sorry, but I can't look at data not generated by our own plugins, since I have no idea what your plugin does and how it works. The broken behaviour I detected may very well be caused by your plugin. Please try replicating the issue with one of the stock plugins instead (VideoRoom if you're doing an SFU).

lminiero commented 4 years ago

As a side note, I wrote a simple .pcap parser that traverses all RTP packets, and then, to simulate a VideoRoom scenario, calls janus_rtp_header_update twice on each packet (because we do it once after we receive the packet and before passing it to the plugin, and once in the VideoRoom plugin itself before passing it back to the core). The output seems correct there:

Parsing /home/lminiero/send_0.7.6.pcap...
pt=111, seq=7358, ts=3906136602
  -- [audio-1] pt=111, seq=1, ts=0
  -- [audio-2] pt=111, seq=1, ts=0
pt=100, seq=6402, ts=2112989410
  -- [video-1] pt=100, seq=1, ts=0
  -- [video-2] pt=100, seq=1, ts=0
pt=100, seq=6403, ts=2112989410
  -- [video-1] pt=100, seq=2, ts=0
  -- [video-2] pt=100, seq=2, ts=0
pt=100, seq=6404, ts=2112989410
  -- [video-1] pt=100, seq=3, ts=0
  -- [video-2] pt=100, seq=3, ts=0
pt=111, seq=7359, ts=3906137562
  -- [audio-1] pt=111, seq=2, ts=960
  -- [audio-2] pt=111, seq=2, ts=960
[..]

So, unless the custom plugin is doing something wrong (again, no way to know), it may be some timing or initialization issue. The first timestamp for a new RTP switching context should always be 0 (exactly as we should always have sequence number 1 as a start), which isn't happening in the outgoing .pcap (see how it's 281880 for both audio and video instead from the previous post).

lminiero commented 4 years ago

Unrelated PS: after we fixed the G.722 desync a few days ago, the commit https://github.com/meetecho/janus-gateway/commit/3c8e9452718b1b7dc2cbd270a1f5ec46201c8b3f fixes it for isac16 and isac32 as well: we've made the code more flexible, and now we rely on the payloadtype/clockrate mapping to know dynamically what clock rate RTCP should refer to.

lminiero commented 4 years ago

@rvo337 you may want to check if https://github.com/meetecho/janus-gateway/commit/c0397cdba2787c4d64db91343d621d50d8092bfd fixes your desync when removing/adding video streams dynamically. I tested this briefly in the VideoRoom and it seems to perform better for me. The patch basically stops sending video SR (sender reports) if video has been disabled after an SDP update, while before it would keep on sending them even though there was no more outgoing video; SR are restored when video is added back. Of course the same applies to audio renegotiation.

In case it still doesn't work as expected, please no more screenshots, as they're useless to me. Provide a .pcap dump of the correct publisher and subscriber PeerConnections instead.

laurensramandt commented 4 years ago

I have been experiencing the exact same symptoms and it seems to happen less fast when (the client is) forcing the sdpSemantics to plan-b instead of unified-plan