Add jitterBufferMaximumDelay attribute to RTCRtpReceiver

eldarrello commented 9 months ago

Background

There is missing possibility to force jitter buffer to operate with specific delay. Existing attribute jitterBufferTarget enables setting of minimum delay, but upper bound is not controllable from application layer, meaning that jitter buffer can increase the delay according to network conditions. There are use cases where it is unavoidable to have full control over jitter buffer delay. For example one scenario where several jitter buffers need to be synchronised and maintaining good synchronisation is much more important than potentially loosing some packets due to too small buffering.

Proposed API

partial interface RTCRtpReceiver {
  attribute DOMHighResTimeStamp? jitterBufferMaximumDelay;
};

Details

jitterBufferMaximumDelay attribute would allow to control maximum amount of buffering in jitter buffer.

alvestrand commented 9 months ago

Well - jitterBufferTarget is a target, it's not an "add this much delay" API. What's the use case where you need the exact control, and is it listed in webrtc-nv-use-cases?

eldarrello commented 9 months ago

Under low latency streaming there is N38 "The application must be able to control the jitter buffer and rendering delay. This requirement is addressed by jitterBufferTarget, defined in [WebRTC-Extensions] Section 6." In my opinion it is only half addressed, as there is control over minimum jitter buffer delay but nothing to limit maximum delay. The use case, which I am interested to improve is when multiple participants join a meeting from their laptops in the same room then the remote audio playback has to be synchronised to avoid echo from that room. Current AEC in Webrtc is not able to cancel out playback signal from other laptops if that happens earlier than the reference playback signal. If the synchronisation of playbacks is in +/- 10ms the AEC can handle this pretty good.

alvestrand commented 9 months ago

So if the problem is that you have two laptops A and B in a room, and A plays out the signal at T+10 while B receives the signal at T+30, fixing the jitter buffer of both to min=max=20 ms will still give 20 ms difference between the signals. if you want to synchronize playout, it seems to me that you need the capture timestamp + synchronized clocks between the PCs + an agreement between the PCs on the offset between the capture timestamp and the playout time.

I don't think just controlling the jitter buffer is going to give you that.

eldarrello commented 9 months ago

There are of course more details how to achieve the sync. Controlling the amount of buffering in jitter buffer is just one piece of the puzzle.

eldarrello commented 8 months ago

Given the discussion in https://webrtc-review.googlesource.com/c/src/+/339901 it looks like separate attribute maximumJitterBufferDelay makes more sense rather than adding another mode. @alvestrand, do you agree?

alvestrand commented 8 months ago

Seems more sensible to me. I assume that if packets don't come in within the max jitter buffer delay, the jitter buffer will synthesize samples.

jakobivarsson commented 8 months ago

I assume that if packets don't come in within the max jitter buffer delay, the jitter buffer will synthesize samples.

Contrary to popular belief, NetEq currently doesn't do this. It would be a quite substantial change and I'm not sure it is ideal from a quality perspective.

I'm fine with having a maximum target delay however (allowing the playout delay to be higher temporarily).

Well - jitterBufferTarget is a target, it's not an "add this much delay" API.

In practice it is implemented as a minimum target jitter buffer delay.

jakobivarsson commented 8 months ago

Also, for syncing streams, I'm not sure why a maximum delay is needed. For example, if you look at the audio video sync code in WebRTC, delay is only ever "added" (through setting the minimum delay) to the stream which is ahead.

eldarrello commented 8 months ago

Contrary to popular belief, NetEq currently doesn't do this.

NetEq would give out silence or comfort noise when buffer runs empty, right? Or what kind of synthesising is meant here?

Also, for syncing streams, I'm not sure why a maximum delay is needed

I haven't looked how the audio video sync is implemented, but isn't it that the delay is ever added only for video when audio is lagging? And delay for video can be applied quickly and precisely. If the NetEq respected the jitterBufferTarget delay more precisely then it would almost work for syncing multiple audio streams. Being able to force maximum delay helps to avoid the cases where NetEq leaves the delay above jitterBufferTarget.

jakobivarsson commented 8 months ago

NetEq would give out silence or comfort noise when buffer runs empty, right? Or what kind of synthesising is meant here?

Sure, what I meant was that packets are not discarded, in favor of synthesising samples, if they are later than the maximum delay.

but isn't it that the delay is ever added only for video when audio is lagging

No, both audio and video can get delayed to sync.

Being able to force maximum delay helps to avoid the cases where NetEq leaves the delay above jitterBufferTarget.

If one stream is behind, why not just increase the delay on the other streams?

eldarrello commented 8 months ago

packets are not discarded

But that should be easy change. And can be done when user sets maximum delay indicating that loosing some delayed audio is ok. In syncing multiple audio playbacks scenario it is also not desirable to playout old audio, which may have already been heard from other playbacks.

if one stream is behind, why not just increase the delay on the other streams?

This has been tried and works in some extent if you can predict the reason why actual playout delay is higher than the target. Otherwise there would be loop to adjusting the delay to infinity. Imagine A finds that its delay is bigger than B's by 40ms and tells B to increase it's delay by 40ms. Then B instead of reaching the target precisely also leaves playout delay at +40ms. Now A needs to adjust. With some complex logic it may work in increasing the delay direction, but once the jitterBufferTarget is set the delay never starts reduce once network conditions get better. Getting the delay back to lower values is even more complex.

jakobivarsson commented 8 months ago

Hm...if B is behind A by 40ms and B has a minimum target of 40ms, then you would first remove that minimum target before adding delay to A? It doesn't make sense to simultaneously delay both streams if you want to achieve lowest possible latency.

eldarrello commented 8 months ago

Removing the delay constraint isn't really an option. With just A and B the example is of course very simplistic. There can be N streams to synchronise and once the constraint is removed the delay can change too quickly in unknown direction. Really hard to maintain sync in such conditions. Also figuring out which buffers require constraint and which do not in a given moment doesn't sound like something, which could work reliably. In my vision how it can work is that all buffers are forced to specific delay (setting min and max) at the same moment according to worse jitterBufferMinimumDelay monitored from webrtc stats. Very simple.

jakobivarsson commented 8 months ago

once the constraint is removed the delay can change too quickly in unknown direction

Remove it iteratively in steps until the right delay is achieved?

Could you explain in more detail why removing a delay constraint is not possible?

Really hard to maintain sync in such conditions.

In the example above the streams are already out of sync?

In my vision how it can work is that all buffers are forced to specific delay (setting min and max) at the same moment according to worse jitterBufferMinimumDelay monitored from webrtc stats.

Network latency / jitter and audio device latency will be different for different devices and can change over time, so the delay needs to be adjusted continuously and independently (as in different streams will have different delay settings to achieve sync) anyway?

eldarrello commented 8 months ago

Could you explain in more detail why removing a delay constraint is not possible? In the example above the streams are already out of sync?

I would need to experiment with it again to be more specific, but it is not new idea and ruled out in early stages of feature development. Basically if you let one or more jitter buffers change delay freely, it first takes time to measure playout delay, then communicate that to other jitter buffers and then it takes time until target is reached. During that time the playout is out of sync and AEC fails to cancel out echo. The feature with synchronised playbacks is publicly available in a product and it is possible to play with it if there is interest. I am pretty sure that the results what can be achieved with using only jitterBufferTarget with current NetEq behaviour have hit the limits and that is why there is effort to find ways to improve it. One more example: Lets say only one receiver gets network glitch for 1 sec which can easily happen in WIFI networks, while other receivers keep operating normally. Now for one having the glitch the buffering delay jumps to 1 sec. It wouldn't make sense to try to adapt the rest of the receivers to the same 1 sec delay. Here is where jitterBufferMaximuDelay attribute comes to the rescue.

Network latency / jitter and audio device latency will be different for different devices and can change over time, so the delay needs to be adjusted continuously and independently (as in different streams will have different delay settings to achieve sync) anyway?

Yes. Correct.

eldarrello commented 8 months ago

Would need to write a proper explainer eventually, but for now I can reiterate some key points. The problem: synchronisation of audio playback streams in different computers. Existing jitterBufferTarget doesn't provide enough control as actual playout delay can stay at much higher value than requested target and second problem is the sharp delay jump after underrun, which ignores the target completely. If to take 1 sec delay jump as an example, which is reality in busy WIFI networks then given the 100ms per second allowed time stretching rate in NetEq, it would take ~10 sec to adapt in worst case, but 5 sec in ideal case when the 1 sec buffer also starts to accelarate at maximum rate after the glitch. Leaving the playouts out of sync for such a long periods is not feasible. As far as I understood proposal to relaxing 100ms/sec stretching limitation was also no go. Suboptimal workaround is to mute the lagging playback until it's playout delay recovers, but much better solution would be to avoid big delay jumps in the first place with help of setting max delay for jitter buffer.

jakobivarsson commented 8 months ago

Yes, a proper explainer would be good.

It sounds like you want to add a separate operating mode for NetEq, which I'm hesitant to add (for this potentially narrow use case) unless there is a lot of interest.

It also sounds like a potential footgun for developers where setting a low delay would result in terrible quality.

I could imagine other use cases of a maximum delay API that would like NetEq to operate with as low delay as possible or an upper bound, but not throw away packets in the process, which would not fit your description.

eldarrello commented 8 months ago

It sounds like you want to add a separate operating mode for NetEq, which I'm hesitant to add (for this potentially narrow use case) unless there is a lot of interest.

Yes, from my perspective there is no preference if there is separate operating mode, maxDelay attribute or something else, which would allow to force jitter buffer to operate in a specific delay as precisely as possible. That is why I have proposed different options to find out what suits the best. First there was option to change behaviour of existing jitterBufferTarget to follow the target more closely. That indeed can break some user expectations who are after quality of not loosing a word. Then mode vs. maxDelay. If to add a mode it could also be called something like 'low latency mode', which would disable delay jumps and keeping delay higher after under run. There could be an option for users to prefer latency vs. quality (risk of loosing some words). If I am not mistaken even the whole audio pipeline in webrtc is built with low latency in mind and eliminating the risk of under runs comes second.

It also sounds like a potential footgun for developers where setting a low delay would result in terrible quality.

If that is about changing behaviour of current jitterBufferTarget then I agree. But if there is separate max delay attribute than I don't see how the potential impact can be misunderstood. There is no need to play with the max delay if you don't understand the consequences. Even today developers can set max number of packets the jitter buffer can hold to a low value. Would the possibility to do it dynamically during a call be more dangerous? If there is good documentation in place it should be fine I think.

eldarrello commented 8 months ago

narrow use case) unless there is a lot of interest.

Does there have to be many different web apps interested or would single one with millions of potential users count too? It is possible to add usage counter.

alvestrand commented 7 months ago

Yes, an explainer would help.

The use case, as I understand it, is multiple speakers playing out the same source in the same (physical) room, with different network delays, and the desire to keep them in sync well enough to allow echo cancellation to cancel all the speakers at once.

One alternative design would be to timestamp the audio packets and have a local variable "timestamp to playout offset" that would hard-lock the playout of a given timestamp to the local system clock (choice of this value, and adjustment for things like clock drift, are left as an exercise for the reader); samples that arrived too late to make their scheduled playout time would be replaced with silence.

eldarrello commented 6 months ago

One alternative design

Is that how Google Meet is doing it? https://workspaceupdates.googleblog.com/2024/05/google-meet-adaptive-audio.html?m=1 But at least it should be easier to explain the use case now since Google is doing the same thing.

w3c / webrtc-extensions