Simulcast/SVC capabilities and error handling

aboba commented 6 years ago

In today's WebRTC WG Virtual Interim, questions were raised about how the user agent behaves when asked for simulcast or SVC configurations beyond its capabilities. For example, if an application asks for 5 simulcast streams when for that codec only 3 are supported. Questions:

a. In such a case, does sender.send() reject the promise or does it utilize the maximum number of supported simulcast streams?

b. If the latter, how does the application determine how many simulcast layers are sent?

ibc commented 6 years ago

The app will need to signal the exact and effective simulcast/SVC information to the remote. May be the RTCRtpSender need a way (method, property) to return its effective RTP parameters.

However, that wouldn't be very nice because the browser will start sending RTP before the app can signal the sending RTP params to the remote.

I think the problem here is the overall design of the RTCRtpSender interface. The send() method doesn't fit well with current and future needs (such as the one described in this issue).

Alternative:

sender.send(params)
    .then((effectiveParams) =>
    {
        return mySignaling.send(effectiveParams);
    })
    .then(() =>
    {
        // Start sending RTP now.
        sender.start();
    });

murillo128 commented 6 years ago

Not sure if it is really an issue, as a browse may decide to stop sending any simulcast/svc layer to accommodate to the available bitrate. So the SFU already has to support receiving (dynamically) a different layer structure as the one signaled/set.

ibc commented 6 years ago

Not sure if it is really an issue, as a browse may decide to stop sending any simulcast/svc layer to accommodate to the available bitrate.

That's a different issue. The browser is supposed to know, at least, what it is able to send even if later it disables a layer. For example, when using simulcast the remote needs to know the RTP params of each simulcast stream (SSRC, etc). The receiver (may be an SFU) need that info before packets arrive. However, the browser may decide to not send the high layer due to CPU usage or whatever at any time (even at the beginning) and may decide to enable or re-enable it later.

This is, the browser should never send RTP stuff that was not previously signaled to the remote.

murillo128 commented 6 years ago

So, you have proven my point.

If the browser supports less layers than what is set in the parameters it will only send a subset of those parameters, but not a different set.

If you send those full set of parameters (ssrsc,mids) to the sfu, it will have all the info available to start receiving the subset of simulcast streams available.

ibc commented 6 years ago

Well, for me it's a bit ridiculous that the app has no way to know the effective parameters that the browser is sending or is capable of sending. Theoretically the app can check the browser RTCRtpCapabilities and be aware of the max numbers of SVC layers it can send. However, those capabilities express nothing regarding how many simulcast layers can be sent.

IMHO it's not cool at all that the app says "hey RtpSender, send all of this" and then the RtpSender can just send a small subset of that without having the app a way to know it.

murillo128 commented 6 years ago

I agree that it is ridiculous, but it is what it is.

Just pointing out that it is not an issue from an SFU point of view.

ibc commented 6 years ago

@aboba said:

if an application asks for 5 simulcast streams when for that codec only 3 are supported.

May be RTCRtpCapabilities should include a maxSimulcastStreams field and, if the app calls send() with more than that, reject it.

@murillo128 said:

I agree that it is ridiculous, but it is what it is.

AFAIK the ORTC spec is not written in stone :)

aboba commented 6 years ago

@ibc @murillo128 @pthatcherg A somewhat related issue has been brought up in WebRTC 1.0: https://github.com/w3c/webrtc-pc/issues/1872

At the May 22 Virtual Interim, Harald proposed (for sendEncodings):

Set # of encodings to max # that can be sent
If max # changes over time, reduce (assumption: never increase)
Enable only the first one by default, more if user asks
Ignore requests for more layers than available

To make this work in ORTC, we would need to have sender.getParameters() so that the number of encodings could be retrieved. Might be worth covering this issue in a presentation at the June 19-20 WEBRTC WG Face-to-Face.

See slide 18 of the deck: https://docs.google.com/presentation/d/1PDZPb-SAfRDD54xe_j8TgaqaV6dBpHmVuw7H6Cg4Ro8/edit#slide=id.g37c88537b9_0_56

ibc commented 6 years ago

When would be sender.getParameters() available to provide effective data?

aboba commented 6 years ago

@ibc For WebRTC 1.0, the proposal is for sender.getParameters() to return the maximum number of encodings after the promise of addTransceiver() is resolved.

aboba commented 6 years ago

@ibc The browser can send fewer simulcast and SVC layers than the maximum capabilities of the codec. There is a proposal to allow the application to determine from statistics how many layers are being sent and also to obtain metrics on the layers. Perhaps the simplest way to resolve this issue for ORTC would be to add a maxSimulcasts attribute, and then during parameters validation, check whether the maximum simulcast, temporal and spatial capabilities are exceeded.

aboba commented 6 years ago

@ibc Your suggestion of maxSimulcastStreams is probably the simplest approach. I will work up a PR using that idea.

ibc commented 6 years ago

:)

robin-raymond commented 6 years ago

Here's my thoughts:

The maximum simulcast/SVC available can be hardware limited, CPU limited, network limited (and possibly some other strange limitations on a target platform
The target platform may have hard upper bounded capabilities (e.g. code may be hardware based with limited simultaneous usages)
Other limits can and do change, e.g. hardware acceleration may have been too busy to support 5 streams it's theoretically capable of doing, so the streams got limited to 3; but then later a hardware resource is freed and now it's capable of doing the upper bound of 5. Same is true for CPU and bandwidth
Hardware upper bound limits might also be based on scale of resources used. For example, can support 2 streams at 4k, 4 streams at 1080p, 16 streams at 480p.
- Expressing the hard upper bound in capabilities may be useful, but is unlikely to be fully expressive enough to indicate all the possible combinations and rules that might apply to every scenario.
There can be other parameters expressed beyond just streams that are unable to be discovered at this time (e.g. a single stream way too big to encode given the attached CPU) with no capabilities discovery possible for the limitations.

This leads me to ask:

What would a programmer expect if they asked for 5 streams but got 3 because the current resources were unable to deliver, and then changed back to 5 as resources became available?
What would a programmer expect if they asked for 5 streams but the system has a hard upper bound in capabilities of 3 streams so it's never possible to deliver 5?
What would a programmer expect if they asked for 5 streams but the system has a hard upper bound not expressed in capabilities because the capabilities are too complex to express with a simple "this meany simulcast/SVC streams" value?
How does whatever we do behave in other situations that have similar capabilities mismatch (e.g. ask for a codec that doesn't exist), or capability limits that cannot be expressed (stream too big to encode)?

In situations like asking for a codec that doesn't exist, I'm fine with an exception being thrown. It was very clear that the programmer asked for something that just doesn't exist. For capabilities that we can't express (like streams too big to encode), it's always a best effort, so the engine produces what it can do.

So if we define a hard upper limit for the number of simulcast/SVC in capabilities and it's violated, I do expect a rejection. But I suspect that most cases a hard limit won't be the issue and soft limits will be much more likely and those end up being "best effort". I'm not sure it's worth adding this hard limit definition because of it.

We still have an overall issue of how does a programmer know what's going on "right now". That changes over time as resources change. I think this is a different issue though.

ibc commented 6 years ago

Good points @robin-raymond. Options are:

Capabilities say maxSimulcastStreams: 5 for a specific video codec, but current software limits them them to just 3 streams. The app calls rtpSender.send() with 5 streams (encodings):
1. send() is rejected with a specific error telling that there are "too much encodings". This is hard to react to from the app point of view.
2. send() resolves, but its effective rtpParameters.encodings array just contains 3 entries.

This is like asking for something and then checking the effective applied values, which goes against the ORTC nature in which the app is supposed to know everything it can do before attempting to do it.

So, here a proposal: dynamic RTP capabilities:

RTCRtpSender.getCapabilities() returns a different result depending on when it's called so, if called when CPU is very busy (or other circumstances than decrease the maximum number of simulcast layers) it returns a different set of capabilities (per codec).
However, such a "dynamic" getCapabilities() should guarantee that, the capabilities should be valid within a sort period of time (between getCapabilities() and send() are called).

robin-raymond commented 6 years ago

@ibc I think this is actually a bit more of a generalized issue and I don't think your recommendation can realistically be implemented under a lot of conditions (sadly). I'll explain why I think this to be the case...

We already have maxTemporalLayers and maxSpatialLayers in capabilities on codecs. But at this point a rejection does not happen if the programmer exceeds these capabilities and we do not describe the checks needed were the engine should reject. (Note: Maybe the spec should describe rejection scenarios if exceeded, and that's a mistake, but it's a whole lot more complicated that this issue).

The trouble is with these settings, just like maxSimulcastStreams is they are just not expressive enough to describe all the real world hard legal limits that exist. For example, maxTemporalLayers might be 2 and maxSpatialLayers might be 2 but they can't be used together. Or VP9 codec might support both, but in only certain configurations and in certain scales supported in certain combinations. Worse, these limits might change based on hardware which may not be easy to discover by an engine until time of acquisition running on top of an OS (or might be allowed by the OS right now but based on changing factors [e.g. even crazy things like battery level] might not be allowed later in time). These are all real hard limits but asking for a complex capabilities to describe all this stuff is a bottomless pit. I fear in trying to define them, we will never get it right and they will never be expressive enough. If we get super fancy, they will be way too complicated to even utilize or implement or not possible to implement on certain platforms.

This is where the constraints idea for media isn't horrible. The programmer can ask for a bunch of stuff, never truly knowing if it's possible, and they get accepted or rejected, and the programmer can ease constraints if they want and retry. Media has a bit of advantage that ranges are supported (but if we supported ranged values too, then that gets really complicated fast with things like the layer dependency trees possible for codec spatial/temporal layering).

...and this is just addressing hard limits that do not change. It gets way more complicated for fluctuating limitations or limitations that are not pre-known until acquisition time and may even change over time.

Diving into shared hard limits, like max encoding instances of a codecs across the entire OS we run into other issues. In your example, you say (paraphrased) to have capabilities return the current available values based on capacity and guarantee them within a reasonable time frame to allow the programmer to call send(...) or receive(...).

The trouble is how long to guarantee? So long as the current JS has not entered re-entered the event loop? What if multiple promises are being awaited before send()/receive() is called? Does that force the programmer to structure their code in such a way that getCapabilities() is called only before send()/receive() is called? If that is unreasonable (and I think it is) then what is the alternative? Is it time based? If so, how long? If this is a shared OS resource limitation, do we ask for the OS to allocate the resources on the off chance the programmer may ask for these resources during this window of time and then deallocate the resource? That's going to be expensive and non-realistic for many OSes where allocation is tied directly to acquisition of the resources with no in-between steps.

Basically, I think guaranteeing a resource is not really all that doable in a lot of programming + OS scenarios. More likely it would end up being likely the getCapabilities() would return "3" but then some other resource may improbably grab on of the resources lowering to "2" just resource before send()/receive() is called. Thus a guarantee isn't a true guarantee but more of a probability to succeed.

So I'm just not convinced it's the right approach to put in a few basic hard limits, when more likely the soft limits are the real issue, or the realistic limits are not capable of being easily expressed in capabilities. I'm not sure the current expressible values cover the real use cases a programmer likely cares about properly, e.g. "can you do this feature, or not?". Does the engine support simulcast (or not)? Does a codec support SVC (or not), and if so, what kind? temporal? spatial? Maybe the a boolean true/false support indicator is more valuable than the hard limit number value defined now (given the real hard limit is no where near expressive enough and soft limits are more likely the real issue even when a hard limit could be adequately described with a simple value).

Or maybe a better approach would be to reject any known limit being exceeded (i.e. if the engine knows it can't be done but has no way to express it) and never fail for soft limits (which are only best efforts only). Maybe under this condition a few basic hard limits could be expressed to help like maxSimulcastStreams, if they demonstrate valuable use cases and good hints of what the programmer may hit for common brick walls and can be expressed properly.

But my issue with rejection for all hard limits will be that most likely the programmer is going to ask for stuff that is close-to-right but not exactly-right and they are going to get rejected a lot instead of getting something akin to what they wanted. To fix this likely scenario, a programmer would need ranged values like media constraints (which again are very complicated with layers), or they will have to implement a lot of fall back re-attempts.

If we reject under any limit being exceeded, I can see the programmer doing stuff like this: ask for 2 spatial + temporal. But the engine rejects because layering and/or scales used is just not possible in this combination. The programmer backs off, retries with temporal or spatial but not both. Rejection happens because scale is not exactly possible (but close to possible). Programmer then scales back and doesn't ask for spatial or temporal because both were rejected, so now they ask for a vanilla stream with no features. Best effort would have gotten them a close to asking but using rejections caused the programmer to get no special features at all (because guessing what's wrong didn't yield a proper path to success).

My bottom line:

actual hard limits are very difficult to properly express in any real consumable way by a programmer
rejection upon exceeding a non-expressed hard limit will likely end up causing a LOT of rejection where the programmer gives up and falls back prematurely to vanilla streams instead of getting any advanced features close to what was asked.
best effort might not be exact enough to what the programmer really needed and succeeded prematurely
The programmer has no good way to discover what values really did get accepted (or could possibly get accepted)
Even if values were possible express in capabilities and accept "right now", current resource constraints may not allow the same values to succeed moments layer (although given a bit more time might succeed once again)

A possible way to fix would be to reject the send()/receive() but then give back a possible tweaked settings in the rejection which might be accepted if retried. The programmer has the option to retry the recommendation, or do their own back off instead. Given soft limits fluctuate, any recommended replacement values by an engine would (out of necessity) not fail but be guaranteed with a best effort as resources allow. If we need to know what's actually being allowed "right now" after a success, we'd likely need a totally different mechanism.

ibc commented 6 years ago

The programmer can ask for a bunch of stuff, never truly knowing if it's possible, and they get accepted or rejected, and the programmer can ease constraints if they want and retry.

Who is the "programmer"? The guy who is playing a video poker game at home? That approach won't work. No app is gonna handle unspecified rejections and "try again" with new and more relaxed constraints. How much "relaxed"? How many retry attempts?

You basically argue the same at the bottom of your comment :)

The trouble is how long to guarantee? So long as the current JS has not entered re-entered the event loop? What if multiple promises are being awaited before send()/receive() is called?

Right. That's why I did not specify for how long it should be guaranteed.

A possible way to fix would be to reject the send()/receive() but then give back a possible tweaked settings in the rejection which might be accepted if retried. The programmer has the option to retry the recommendation, or do their own back off instead.

I think we must focus in a different topic:

Currently ORTC is designed in a way that the app can check browser capabilities and then build a complete set of RTP parameters for sending or receiving. And since the app knows those capabilities, ORTC requires that parameters given to send() fully conform to those capabilities.

But: This model is broken when it comes to complex parameters that are also affected by dynamic conditions (such as CPU usage and so on). This is perfectly exposed in your comment above.

Perhaps we should discuss whether the current ORTC model based on send(EXACTLY_THESE_RTP_PARAMETES) is feasible or not. And I think it's not.

Something always in my mind is this:

const sender = new RTCRtpSender(videoTrack);

sender.send(desiredRtpParameters)
  .then((effectiveRtpParameters) => {
    mySigalingStuff.notify(effectiveRtpParameters);
  })
  .catch((error) => {
    console.error("fatal error: ", error);
  });

This way:

desiredRtpParameters do not need to include fields such as ssrc or similar (they are chosen by the browser and given later in effectiveRtpParameters).
desiredRtpParameters may request 2 spatial layers plus 3 temporal layers but, if the codec does not support such a mix, the browser could select a best effort approach and describe it in effectiveRtpParameters.
This is NOT about dynamic/temporal limitations (busy CPU, etc) since the browser MUST be ready for moving from high CPU usage to low CPU usage and vice-versa.
- This is, effectiveRtpParameters should always indicate the best feasible case, and then the browser may choose whether to dynamically relax what it sends or not.

Another approach (which in fact is very similar) is providing some kind of RTP parameters factory:

const sender = new RTCRtpSender(videoTrack);

RtpSender.generateParameters(videoTrack, desiredRtpParameters)
  .then((effectiveRtpParameters) => {
    return sender.send(effectiveRtpParameters);
  })
  .then(() =>
    mySigalingStuff.notify(sender.rtpParameters); // sender.rtpParameters match effectiveRtpParameters
  })
  .catch((error) => {
    console.error("fatal error: ", error);
  });

ibc commented 6 years ago

To summarize:

Current ORTC design

Check capabilities.
Build a set of RtpParameters that MUST be valid (otherwise send() will fail).
Call send() with those RtpParameters and signal them to the remote peer.

My proposal

Check capabilities (still required to know which codecs are supported, etc).
Build a set of RtpParameters that MAY be valid/feasible.
Let the RtpSender or a factory API to produce really effective RtpParameters.
Apply those effective RtpParameters to the sender and signal them to the remote peer.

robin-raymond commented 6 years ago

@ibc I'm actually liking the "options" enum kind of thing isn't bad. Could even be a property of the rtp parameters. RtpParameters { SendReceiveOptions options } where possible "strict", "relaxed" or whatever. I have no idea what to name it yet, but the idea isn't bad (to me). Or as an additional parameter to send(...) or received(...)

I'm going to map out two approaches:

1) A promise could return the values actually applied (as possible -- on a receiver, not all encodings will get filled in straight away for things like muxid being used where no SSRC is specified and no stream will have arrived yet). Or if a rejection occurs it could indicate why (human readable) and a set of parameters that could work as an alternative (i.e. the "relaxed" values).

2) A method of Promise<RtpParameters) generateEffectiveParameters(params) is added.

In the send()/receive() promise case, it's a bit more atomic that if relaxed option is allowed then all resources available could be allocated and applied in a single atomic step with less probability of failure and retry. But when using a "strict" option with a send()/receive() rejection promise occurs or by exposing a generateParameters(params) then it's less atomic. The resources that might have been able to acquired at the time of massaging the parameters end up no longer available available later. This is why I prefer an option to rtp parameters as I would cause the returned parameters in either case automatically set the options to "relaxed" so the next request will succeed (where the programmer could change this value back once again to "strict" if they absolutely must).

Note: We currently do not have a "get" method to obtain actually applied parameters in the sender or receiver. We do not need this in this model yet either if the promise return alternative good values.

For 1:

Pseudo code if we did a send/receive promise with usable successful or rejection parameters:

parameters.options = strict;
sender.send(track, parameters)
  .then((appliedParams) =>
    myassert(error.parameters.options === strict); // new effective parameters given are now strict
    mySignaller.notify(appliedParameters);
  })
  .catch(error) => {
    console.log(error.reason); // nice to know why it failed in our log
    myassert(error.parameters.options === relaxed); // new parameters given are relaxed by default
    sender.send(error.parameters)
      .then((appliedParams) =>
        mySignaller.notify(appliedParams);
      })
      .catch(unlikelyToFailError) {
        console.log(unlikelyToFailError.reason);
        throw unlikelyToFailError;  // treat this as fatal in example
      });
  });

There's one [fatal?] flaw to the signalling the applied parameters after calling send(...). There's no way for the receiver to setup prior to send(...) method being called. Which suggests that we might absolutely need to create a generateEffectiveParameters(...) method which MUST be guaranteed to succeed if used in a send(...)/receive(...) method on best effort basis. That allows for signalling to happen prior to the send(...) being called.

EDIT: I don't think this is necessarily a fatal flaw. The receiver could set up it's settings first before ever calling send(), then the sender (and should anyway). The sender must conform to the receiver's capabilities and thus signaling a second time what the sender does isn't truly needed.

For 2:

Perhaps this is better:

parameters.options = strict;
sender.generateEffectiveParameters(parameters)
  .then((usableParams) =>
    mySignaller.notify(usableParams);
    mySignaller.onreceivednotify = myfunction() {sender.send(usableParams);}
  })
  .catch(error) => {
    console.log(error); // this is fatal
    throw error; // cause fatal non-caught error in example
  });

Whatever we do, once the settings are applied the media engine still needs to be able to dynamically change streams based on CPU, bandwidth, or whatever. Because non hard-limits still happen so no matter how strict we make the parameters, available resources change over time.

robin-raymond commented 6 years ago

Actually, either 1 or 2 can work. If it's based on capabilities exchange. The receiver first gets it's capabilities, creates parameters, calls receive(...), sends those params to the sender. The sender takes the receiver's settings, does an intersection with its own capabilities, then calls send(...). The send parameters don't really need to be signaled back to the receiver. So a promise (or promise rejection) that carries RtpParameters on send() or receive() can absolutely work and my [fatal?] concern above I don't believe is valid.

aboba commented 5 years ago

Some parts of this issue (simulcast trail drop) have been handled in WebRTC 1.0. So I am closing this issue and refiling more specific issues.

w3c / ortc

Simulcast/SVC capabilities and error handling #837

Current ORTC design

My proposal