jitsi / jitsi-videobridge

Jitsi Videobridge is a WebRTC compatible video router or SFU that lets build highly scalable video conferencing infrastructure (i.e., up to hundreds of conferences per server).
https://jitsi.org/jitsi-videobridge
Apache License 2.0
2.91k stars 991 forks source link

Video Freezing in Chrome #156

Open stongo opened 8 years ago

stongo commented 8 years ago

Using new versions of the bridge, have been experiencing random freezing of the video channel. Doesn't occur for every participant. I've included a webrtc-internals graph showing it happening just before 7:55pm. Nothing abnormal in bridge logs or our clients log (not using jitsi meet). We ended up rolling back to version 564 where this does not happen. talky-jvb-video-freeze

jitsi-developers commented 8 years ago

Just before 7:55 nacks and plis increase and received frame rate drops to 0, but I notice some peculiar behavior starting at 7:54. You don't seem to be using simulcast. Are you using RTCP termination? Could you please enable fine logging at the bridge and logging at the client and share the log files with us?

On Tue, Mar 1, 2016 at 2:59 PM, Marcus Stong notifications@github.com wrote:

Using new versions of the bridge, have been experiencing random freezing of the video channel. Doesn't occur for every participant. I've included a webrtc-internals graph showing it happening just before 7:55pm. Nothing abnormal in bridge logs or our clients log (not using jitsi meet). We ended up rolling back to version 564 where this does not happen. [image: talky-jvb-video-freeze] https://cloud.githubusercontent.com/assets/1449748/13441660/683e8f46-dfc6-11e5-823e-4d6514f96681.png

— Reply to this email directly or view it on GitHub https://github.com/jitsi/jitsi-videobridge/issues/156.


dev mailing list dev@jitsi.org Unsubscribe instructions and other list options: http://lists.jitsi.org/mailman/listinfo/dev

stongo commented 8 years ago

We aren't using simulcast, it's true.

I'm glad you actually bring up RTCP termination. The documentation is outdated, and we aren't sure how to make it work.

We'd like to enable what was org.jitsi.impl.neomedia.rtcp.termination.strategies.HighestQualityRTCPTerminationStrategy but setting it according to the doc fails.

Maybe setting it correctly might fix the issue?

jitsi-developers commented 8 years ago

HQRTS no longer exists, I would suggest to either enable BasicRTCPTerminationStrategy or disable RTCP termination completely (just don't set anything). The correct way to set the BRTS is this:

org.jitsi.videobridge.rtcp.strategy=org.jitsi.impl.neomedia.rtcp.termination.strategies.BasicRTCPTerminationStrategy

On Tue, Mar 1, 2016 at 3:15 PM, Marcus Stong notifications@github.com wrote:

We aren't using simulcast, it's true.

I'm glad you actually bring up RTCP termination. The documentation is outdated, and we aren't sure how to make it work.

We'd like to enable what was org.jitsi.impl.neomedia.rtcp.termination.strategies.HighestQualityRTCPTerminationStrategy but setting it according to the doc fails.

Maybe setting it correctly might fix the issue?

— Reply to this email directly or view it on GitHub https://github.com/jitsi/jitsi-videobridge/issues/156#issuecomment-190908389 .


dev mailing list dev@jitsi.org Unsubscribe instructions and other list options: http://lists.jitsi.org/mailman/listinfo/dev

stongo commented 8 years ago

We tried with BasicRTCPTermination, but it was throwing an index out of bounds exception. One change I did make which seems to help was removing org.jitsi.impl.neomedia.transform.srtp.SRTPCryptoContext.checkReplay=false Testing in our staging environment the issue seems to have gone away, but stage isn't always reliable for reproducing the bug. I'll plan a deploy tomorrow to production turning off rtcp termination and removing the checkReplay setting and update this issue again. Thanks for the help!

stongo commented 8 years ago

Video is still freezing. It's pretty easy to reproduce right now on https://talky.io with 3+ callers as I haven't rolled back yet. I also have a webrtc-internals dump if that would help

jitsi-developers commented 8 years ago

Hi Marcus, in order to help us understand the situation we need logs from the bridge, the sip-communicator.properties file you use to configure the bridge and screenshots from the webrtc-internals page (because it's much quicker than having to graph the raw data).

Best, George

On Thu, Mar 3, 2016 at 10:34 AM, Marcus Stong notifications@github.com wrote:

Video is still freezing. It's pretty easy to reproduce right now on https://talky.io as I haven't rolled back yet. I also have a webrtc-internals dump if that would help

— Reply to this email directly or view it on GitHub https://github.com/jitsi/jitsi-videobridge/issues/156#issuecomment-191841620 .


dev mailing list dev@jitsi.org Unsubscribe instructions and other list options: http://lists.jitsi.org/mailman/listinfo/dev

jitsi-developers commented 8 years ago

P.S. before you get any logs from the bridge, it would be helpful to set the global log level to FINE by editing the logging.properties file.

On Thu, Mar 3, 2016 at 10:40 AM, George Politis gp@jitsi.org wrote:

Hi Marcus, in order to help us understand the situation we need logs from the bridge, the sip-communicator.properties file you use to configure the bridge and screenshots from the webrtc-internals page (because it's much quicker than having to graph the raw data).

Best, George

On Thu, Mar 3, 2016 at 10:34 AM, Marcus Stong notifications@github.com wrote:

Video is still freezing. It's pretty easy to reproduce right now on https://talky.io as I haven't rolled back yet. I also have a webrtc-internals dump if that would help

— Reply to this email directly or view it on GitHub https://github.com/jitsi/jitsi-videobridge/issues/156#issuecomment-191841620 .


dev mailing list dev@jitsi.org Unsubscribe instructions and other list options: http://lists.jitsi.org/mailman/listinfo/dev

stongo commented 8 years ago

George, thanks for the help! I think the graph above should suffice then. Here's the sip-communicator.properties:

org.jitsi.videobridge.TCP_HARVESTER_MAPPED_PORT=443
org.jitsi.videobridge.TCP_HARVESTER_PORT=4443
org.jitsi.videobridge.STATISTICS_TRANSPORT=pubsub
org.jitsi.videobridge.ENABLE_STATISTICS=true
org.jitsi.videobridge.STATISTICS_INTERVAL=15000
org.jitsi.videobridge.PUBSUB_SERVICE=pubsub.foo.bar
org.jitsi.videobridge.PUBSUB_NODE=videobridge
org.jitsi.videobridge.SINGLE_PORT_HARVESTER_PORT=-1
org.ice4j.ice.harvest.ALLOWED_INTERFACES=bond0

Here's a sampling of the logs https://ghostbin.com/paste/adfg9

fippo commented 8 years ago

george: http://fippo.github.io/webrtc-dump-importer/ gives you nice graphs in a matter of seconds. Zoomable even.

damencho commented 8 years ago

Hey @fippo is there a way to make those dumps from js code, I'm asking whether it is possible to do the dumps while selenium testing? Thanks.

fippo commented 8 years ago

@damencho do you know your own code? :-p traceablepeerconnection was built exactly for this. I suppose you can also open webrtc-internals in an extra tab in selenium but never tried it.

jitsi-developers commented 8 years ago

Thanks @fippo!

@Marcus The log snapshot that you shared is filled with XMPP ping timeouts and re-transmission requests from the clients. There could be something wrong with our NACK termination implementation (which is enabled by default in recent versions of the bridge) or it could be something wrong with the network.

You can add the following 2 lines to the sip-communicator.properties file to disable NACK termination :

org.jitsi.service.neomedia.VideoMediaStream.REQUEST_RETRANSMISSIONS=false org.jitsi.videobridge.DISABLE_NACK_TERMINATION=true

Try that and let us know how it goes.

Best, George

On Thu, Mar 3, 2016 at 10:53 AM, Marcus Stong notifications@github.com wrote:

George, thanks for the help! I think the graph above should suffice then. Here's the sip-communicator.properties:

org.jitsi.videobridge.TCP_HARVESTER_MAPPED_PORT=443 org.jitsi.videobridge.TCP_HARVESTER_PORT=4443 org.jitsi.videobridge.STATISTICS_TRANSPORT=pubsub org.jitsi.videobridge.ENABLE_STATISTICS=true org.jitsi.videobridge.STATISTICS_INTERVAL=15000 org.jitsi.videobridge.PUBSUB_SERVICE=pubsub.foo.bar org.jitsi.videobridge.PUBSUB_NODE=videobridge org.jitsi.videobridge.SINGLE_PORT_HARVESTER_PORT=-1 org.ice4j.ice.harvest.ALLOWED_INTERFACES=bond0

Here's a sampling of the logs https://ghostbin.com/paste/adfg9

— Reply to this email directly or view it on GitHub https://github.com/jitsi/jitsi-videobridge/issues/156#issuecomment-191854525 .


dev mailing list dev@jitsi.org Unsubscribe instructions and other list options: http://lists.jitsi.org/mailman/listinfo/dev

bgrozev commented 8 years ago

Here's a sampling of the logs https://ghostbin.com/paste/adfg9

2016-03-03 11:46:34.065 WARNING: [90782] org.jitsi.videobridge.transform.RtxTransformer.warn() Cannot find SSRC for RTX, retransmitting plain.

This could well indicate a problem with packet retransmissions, which could explain the freeze. We don't run into it, because we haven't yet enabled RTX.

One reason for the bridge not finding the SSRC could be that it wasn't signaled to it. Can you include more of the logs? Specifically the RECV/SENT lines. Also make sure you are using a recent bridge version (which includes Lance's fix).

bgrozev commented 8 years ago

Just found a little bug, preparing a fix. You may want to delay your testing a bit.

stongo commented 8 years ago

okay great. I'll test on our stage site as soon as you let me know. we had been using one of the latest versions with Lance's fix too, just so you know.

bgrozev commented 8 years ago

Videobridge 672 includes the fix.

stongo commented 8 years ago

Deployed 672 with and without suggested NACK settings, and unfortunately it doesn't work at all now

org.jitsi.impl.osgi.framework.launch.FrameworkImpl.startLevelChanged() Error changing start level
org.osgi.framework.BundleException: BundleActivator.start
    at org.jitsi.impl.osgi.framework.BundleImpl.start(BundleImpl.java:313)
    at org.jitsi.impl.osgi.framework.launch.FrameworkImpl.startLevelChanged(FrameworkImpl.java:460)
    at org.jitsi.impl.osgi.framework.startlevel.FrameworkStartLevelImpl$Command.run(FrameworkStartLevelImpl.java:126)
    at org.jitsi.impl.osgi.framework.AsyncExecutor.runInThread(AsyncExecutor.java:111)
    at org.jitsi.impl.osgi.framework.AsyncExecutor.access$000(AsyncExecutor.java:17)
    at org.jitsi.impl.osgi.framework.AsyncExecutor$1.run(AsyncExecutor.java:220)
Caused by: java.lang.NoClassDefFoundError: net/java/sip/communicator/impl/protocol/jabber/extensions/colibri/HealthCheckIQ
    at org.jitsi.videobridge.VideobridgeBundleActivator.start(VideobridgeBundleActivator.java:59)
    at org.jitsi.impl.osgi.framework.BundleImpl.start(BundleImpl.java:293)
    ... 5 more
Caused by: java.lang.ClassNotFoundException: net.java.sip.communicator.impl.protocol.jabber.extensions.colibri.HealthCheckIQ
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 7 more
2016-03-03 23:00:00.766 SEVERE: [17] org.jitsi.videobridge.stats.PubSubStatsTransport.publishStatistics().282 Failed to publish to PubSub node: videobridge - it does not exist yet

Going to rollback one version and test with NACK settings as well.

stongo commented 8 years ago

670 with suggested NACK settings passes staging tests. Will deploy to production tomorrow morning and report back

bgrozev commented 8 years ago

Not sure what the problem with 672 is, possibly just the package was not properly built. In any case, if you get a chance to test this on 672+ without disabling NACK termination, please let us know.

stongo commented 8 years ago

Still freezing in production on 670 with NACK changes. Will give a release > 672 a try

jitsi-developers commented 8 years ago

Hi Marcus, did you have any better luck with jvb > 672? I've had a discussion with Boris and please note that it is NOT a good idea to disable NACK termination as I suggested initially, so if you still have problems, please remove the two NACK termination related configuration options from sip-communicator.properties file and try again.

On Fri, Mar 4, 2016 at 12:39 PM, Marcus Stong notifications@github.com wrote:

Still freezing in production on 670 with NACK changes. Will give a release

672 a try

— Reply to this email directly or view it on GitHub https://github.com/jitsi/jitsi-videobridge/issues/156#issuecomment-192401252 .


dev mailing list dev@jitsi.org Unsubscribe instructions and other list options: http://lists.jitsi.org/mailman/listinfo/dev

stongo commented 8 years ago

Still freezing in 681

stongo commented 8 years ago

Disabling RTX seems like a promising fix. Wasn't able to reproduce freezing on stage. Will confirm for sure with production deploy Monday.

stongo commented 8 years ago

Been running 681 for most of the week with RTX disabled in production. Our feedback form didn't have one report of freezing and our friday update conference also was good. Seems to be fixed!

bgrozev commented 8 years ago

Thanks for the feedback, @stongo! We should be looking into enabling RTX in jitsi-meet in the next couple of weeks, will let you know if we find any issues.

davidertel commented 8 years ago

@stongo how did you disable RTX as you mentioned above?

bgrozev commented 8 years ago

An update on this: we've been working on RTX in the last couple of weeks. We fixed multiple issues, and as far as we know current videobridge versions work correctly with RTX. So, I think this is ready for testing.

We are not yet enabling it in jitsi-meet, because we are running into some problems managing SDP when muting/unmuting (these are jitsi-meet specific issues).

stongo commented 8 years ago

@bgrozev awesome, I'll give it a try on staging again and let you know

@davidertel are you using Jitsi Meet or something else?

jitsi-developers commented 8 years ago

We've been looking at retransmission in general (not necessarily out-of-band RTX, but in band via nack as well) and have noticed that, when we limit the bandwidth on clients, chrome seems to do a poor job of obeying the detected bandwidth. Because of this, when loss occurs (due to chrome sending more bits than it should be), lots of nacks start up and chrome can refuse to retransmit the lost packets due to it detecting that it's sending too much data via retransmitting. We thought maybe this was h264 specific but we were able to repro with vp8 as well. From looking at the bweforvideo graphs, chrome seems to properly detect the correct amount of bandwidth, but regularly sends more. The way this gets played out is lots of periods of frozen video on the receiver.

Just a heads up on something we've seen...we want to gather some more data and get a bug filed on chrome.

On Thu, Apr 14, 2016 at 12:39 PM, Marcus Stong notifications@github.com wrote:

@bgrozev https://github.com/bgrozev awesome, I'll give it a try on staging again and let you know

@davidertel https://github.com/davidertel are you using Jitsi Meet or something else?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/jitsi/jitsi-videobridge/issues/156#issuecomment-210115606


dev mailing list dev@jitsi.org Unsubscribe instructions and other list options: http://lists.jitsi.org/mailman/listinfo/dev

xdumaine commented 8 years ago

@stongo We (Dave and I and company) are using a web client with jingle.js via a focus controller a la talky (but it's our own node.js focus controller).

bgrozev commented 8 years ago

You need to remove the "a=rtpmap:XXX RTX/90000" lines from the SDP you pass to your clients.

xdumaine commented 8 years ago

I've found that more often than not, filing the bug early leads to quicker results. The Chrome team is helpful in identifying workarounds and fixes. We can nudge them for feedback. If you have a dump showing that behavior, let's get it filed with all the info we have. I'll try to get one as well.

fippo commented 8 years ago

no chrome bug here...

xdumaine commented 8 years ago

We're getting freezing without including rtx in the sdp.

type: offer, sdp: v=0
o=- 1460724079835 1460724079848 IN IP4 0.0.0.0
s=-
t=0 0
a=group:BUNDLE video audio data
m=video 1 UDP/TLS/RTP/SAVPF 100 116 117
c=IN IP4 0.0.0.0
a=rtcp:1 IN IP4 0.0.0.0
a=ice-ufrag:apt0u1agcv17b5
a=ice-pwd:1tpk3nc33btjm76p0h6mmk0pt7
a=fingerprint:sha-1 86:1A:F4:87:D2:95:59:62:38:6C:54:E0:A1:0C:C1:AA:08:B4:DF:0F
a=setup:actpass
a=sendrecv
a=mid:video
a=rtcp-mux
a=rtpmap:100 VP8/90000
a=rtcp-fb:100 ccm fir
a=rtcp-fb:100 nack
a=rtcp-fb:100 nack pli
a=rtcp-fb:100 goog-remb
a=rtpmap:116 red/90000
a=rtpmap:117 ulpfec/90000
a=extmap:2 urn:ietf:params:rtp-hdrext:toffset
a=extmap:3 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time
a=candidate:1 1 SSLTCP 2130706431 172.18.27.111 4443 typ host generation 0
a=candidate:3 1 UDP 2130706431 172.18.27.111 10000 typ host generation 0
a=candidate:2 1 SSLTCP 1694498815 54.165.101.244 4443 typ srflx raddr 172.18.27.111 rport 4443 generation 0
a=candidate:4 1 UDP 1677724415 54.165.101.244 10000 typ srflx raddr 172.18.27.111 rport 10000 generation 0
m=audio 1 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8
c=IN IP4 0.0.0.0
a=rtcp:1 IN IP4 0.0.0.0
a=ice-ufrag:apt0u1agcv17b5
a=ice-pwd:1tpk3nc33btjm76p0h6mmk0pt7
a=fingerprint:sha-1 86:1A:F4:87:D2:95:59:62:38:6C:54:E0:A1:0C:C1:AA:08:B4:DF:0F
a=setup:actpass
a=sendrecv
a=mid:audio
a=rtcp-mux
a=rtpmap:111 opus/48000/2
a=fmtp:111 minptime=10
a=rtpmap:103 ISAC/16000
a=rtpmap:104 ISAC/32000
a=rtpmap:9 G722/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level
a=extmap:3 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time
a=candidate:1 1 SSLTCP 2130706431 172.18.27.111 4443 typ host generation 0
a=candidate:3 1 UDP 2130706431 172.18.27.111 10000 typ host generation 0
a=candidate:2 1 SSLTCP 1694498815 54.165.101.244 4443 typ srflx raddr 172.18.27.111 rport 4443 generation 0
a=candidate:4 1 UDP 1677724415 54.165.101.244 10000 typ srflx raddr 172.18.27.111 rport 10000 generation 0
m=application 1 DTLS/SCTP 5000
c=IN IP4 0.0.0.0
a=ice-ufrag:apt0u1agcv17b5
a=ice-pwd:1tpk3nc33btjm76p0h6mmk0pt7
a=fingerprint:sha-1 86:1A:F4:87:D2:95:59:62:38:6C:54:E0:A1:0C:C1:AA:08:B4:DF:0F
a=setup:actpass
a=sctpmap:5000 webrtc-datachannel 1024
a=mid:data
a=candidate:1 1 SSLTCP 2130706431 172.18.27.111 4443 typ host generation 0
a=candidate:3 1 UDP 2130706431 172.18.27.111 10000 typ host generation 0
a=candidate:2 1 SSLTCP 1694498815 54.165.101.244 4443 typ srflx raddr 172.18.27.111 rport 4443 generation 0
a=candidate:4 1 UDP 1677724415 54.165.101.244 10000 typ srflx raddr 172.18.27.111 rport 10000 generation 0
brianh5 commented 8 years ago

We just repro'd freezes on apprtc by limiting uplink bandwidth on one sender to 1.5mbps. We see it with h264 and vp8...it detects the available send bandwidth correctly, but with rtx regularly goes over it which causes more loss and freezes (chrome will also refuse to send rtx if it's bandwidth is too high, so I think there's a bad cycle here that causes problems). We just got some screenshots from webrtc-internals and are going to file something today.

brianh5 commented 8 years ago

Filed against chrome here https://bugs.chromium.org/p/webrtc/issues/detail?id=5797

bradrlaw commented 8 years ago

Where do we stand on this issue? The linked issue to chrome appears closed without anything resolved? We are running into this issue constantly making jitsi unusable for any production type use. This is happening with our own installs, regardless of patch level, as well as the demo at http://meet.jit.si.

A symptom is extremely high packet loss once an endpoint has less than 1.5mbps available. The video will intermittently freeze for upwards of 5 to 15 (or more) seconds.

joelbrewer commented 7 years ago

Any update on this? We are considering a switch to jitsi-videobridge -- however, random freezing on Chrome could be a non-starter.

bbaldino commented 7 years ago

In regards to my previous comment about the chrome issue ("brianh5" above), we found that the way we were simulating low bandwidth links was inaccurate (network simulator on mac, for example, will do loss to simulate a lower-bandwidth link, but not add any delay--chrome keys quite a bit on the delay to lower the bandwidth estimation), once we properly simulated things we no longer saw that issue so told them they could close that bug.

I also found a bug a couple months ago in the porting of the webrtc bandwidth estimation logic on the bridge that was failing to take delay into account (https://github.com/jitsi/libjitsi/issues/212). Fixing that resulted in much better performance on links with high delay (common for poor links that also have low bw). Other than those 2 scenarios, I wasn't aware of any other freezing issues with chrome.