w3c / webrtc-extensions

A repository for experimental additions to the WebRTC API
https://w3c.github.io/webrtc-extensions/
Other
58 stars 19 forks source link

Add RTCRtpEncodedSource and explainer #198

Closed guidou closed 4 months ago

guidou commented 7 months ago

Add an explainer for the upcoming RTCRtpSenderEncodedSource extension proposal


Preview | Diff

guidou commented 6 months ago

For easier reading: https://guidou.github.io/webrtc-extensions/#encoded-source-for-rtc-rtp-sender

aboba commented 6 months ago

It looks like this PR inadvertently removes Section 14 Event Summary. I fixed this.

guidou commented 6 months ago

It looks like this PR inadvertently removes Section 14 Event Summary. I fixed this.

Thanks!

youennf commented 6 months ago

My main thought is that this API requires to wait for all packets of a frame, which creates some latency. This solution does not allow to do what an SFU is doing with the same level of performance. It might be a good enough compromise. If we already consider supporting packet forwarding, maybe that is what we should do instead.

API wise, transferring seems harder than using what we have done for RTCRtpScriptTransform. We should consider the pros and cons of both approaches. For instance, the current proposal is allowing to transfer to cross agent cluster dedicated workers, which is not allowed with RTCRtpScriptTransform. I see that more as an issue than a feature. I also think that an API where you create the source where you use it (instead of creating in worker and then transferring) is slightly easier to use. From an implementation point of view, it would also be simpler (at least in Safari) compared to supporting transferability. From a spec point of view, it would also be simpler since we would not have to deal with when you can transfer and when you cannot.

It is not clear why RTCRtpSenderEncodedSource is not providing a WritableStream like done for RTCRtpScriptTransform.

guidou commented 6 months ago

My main thought is that this API requires to wait for all packets of a frame, which creates some latency. This solution does not allow to do what an SFU is doing with the same level of performance. It might be a good enough compromise. If we already consider supporting packet forwarding, maybe that is what we should do instead.

This is a good compromise for the use case of glitch-free forwarding with multiple input peer-connections with failover. In this case, frames provide a convenient abstraction to do failover quickly (no need for timeouts, just forward a frame from the first peer connection that provides it). It is also ergonomic for some other SFU-like operations where the outcome is frame-based (e.g., drop frames that don't satisfy a certain property).

There is also an effort going on for a packet-level API, although we have not heard yet about developers interested in that API for the use case of glitch-free forwarding using multiple input peer connections. There are other use cases driving the design of that API.

API wise, transferring seems harder than using what we have done for RTCRtpScriptTransform. We should consider the pros and cons of both approaches. For instance, the current proposal is allowing to transfer to cross agent cluster dedicated workers, which is not allowed with RTCRtpScriptTransform. I see that more as an issue than a feature. I also think that an API where you create the source where you use it (instead of creating in worker and then transferring) is slightly easier to use.

We actually only need to transfer to workers within the same agent cluster, so maybe a clarification is needed in the text of the proposed spec. However, I do agree with you that it is better to be able to create the source where you use it.

The main reasons the proposal exposes RTCRtpSenderEncodedSource only on DedicatedWorker are:

From an implementation point of view, it would also be simpler (at least in Safari) compared to supporting transferability From a spec point of view, it would also be simpler since we would not have to deal with when you can transfer and when you cannot.

I agree with you on this point. The idea was to deviate as little as possible from https://github.com/w3c/webrtc-encoded-transform/issues/211#issuecomment-1777458119 in order to make it easier to achieve consensus, but if you agree about exposing RTCRtpSenderEncodedSource on Window, then we can simplify this by removing transferrability.

It is not clear why RTCRtpSenderEncodedSource is not providing a WritableStream like done for RTCRtpScriptTransform.

I agree that a WritableStream would provide more flexibility here. Again, the reason was to minimize deviations from https://github.com/w3c/webrtc-encoded-transform/issues/211#issuecomment-1777458119

If I understand correctly, your concerns can be summarized as follows:

We have already presented arguments in favor of the frame-based API and in previous discussions we have concluded that while latency may be a small disadvantage in some cases, frame-based has some clear advantages, in particular for the scenario of forwarding with glitch-free failover over multiple input peer connections.

The concerns about the spec text I think can be addressed as follows:

WDYT?

youennf commented 6 months ago

I would summarise my concern as:

As I commented on https://github.com/w3c/webrtc-encoded-transform/issues/211#issuecomment-1844972816, there are two different API proposals. The first one is well described in the explainer. The second is a script transform like API, something like:

[Exposed=Window]
interface RTCRtpSenderEncodedSource {
    constructor(Worker worker, optional any options, optional sequence<object> transfer)
};
[Exposed=DedicatedWorker]
interface RTCRtpSenderEncodedSourceController {
    WritableStream writable; // Or enqueue method.
    // Need congestion API, error API and enqueue API
};
partial interface DedicatedWorkerGlobalScope {
    attribute EventHandler onrtcencodedsource;
}

From a user perspective, there is probably no difference. There are probably some differences in terms of future-proofness, ease of use, ease of spec/dev...

jan-ivar commented 6 months ago

How does this stack up against just optimizing the obvious/inoptimal API? E.g.

// Let relayPc be the PC used to relay frames to the next peer.
const [sender] = relayPc.getSenders();
// Let recvPc be the receiving PC
recvPc.ontrack = ({receiver}) => await sender.replaceTrack(receiver.track);
sender.transform = new RTCRtpScriptTransform(worker, {});

Browsers could detect when the input is a receiver.track and forward received RTCEncodedVideoFrames directly on the sender transform's readable (skipping decode and re-encode) where JS can modify metadata.

guidou commented 6 months ago

I would summarise my concern as:

  • The potential redundancy and the extra latency a frame-based API has compared with a potential packet-based API

My position about this is that a frame-based API and a packet-based API offer different tradeoffs and therefore support different use cases. For use cases where the decision is based on frame properties, the extra latency of the frame-based API is not a problem since you have to wait for the whole frame to make the decision. For the glitch-free failover case (a.k.a. fan-in) it is also good because it provides an ergonomic mechanism that increases reliability without introducing timeouts (other than the latency of waiting for a frame). Also, an important practical consideration is that we have production-tested building blocks that make it easy to deploy a frame-based API, whereas a packet-based API is still in early stages of discussion with challenges that may still be unknown.

My other non blocking comment is API shape (keeping the feature set as it is, meaning let's keep it to dedicated worker in this version):

  • Evaluate transfer-based vs. script-transform-like API

As I commented on w3c/webrtc-encoded-transform#211 (comment), there are two different API proposals. The first one is well described in the explainer. The second is a script transform like API, something like:

[Exposed=Window]
interface RTCRtpSenderEncodedSource {
    constructor(Worker worker, optional any options, optional sequence<object> transfer)
};
[Exposed=DedicatedWorker]
interface RTCRtpSenderEncodedSourceController {
    WritableStream writable; // Or enqueue method.
    // Need congestion API, error API and enqueue API
};
partial interface DedicatedWorkerGlobalScope {
    attribute EventHandler onrtcencodedsource;
}

From a user perspective, there is probably no difference. There are probably some differences in terms of future-proofness, ease of use, ease of spec/dev...

Something I like about this alternative shape is the consistency with webrc-encoded-transform, which will give us the opportunity to evolve both APIs in a similar manner. For example, we can define the congestion API in the same manner for both. It also addresses your concern about transferability. I'm fine with proceeding with this shape if you and @jan-ivar are also OK with it.

jan-ivar commented 6 months ago

I understand we got here incrementally, but I don't see why JS is needed when we can add first-class support for fanout. Is the source of an RTCRtpEncodedSource always RTCRtpReceiver? If so, why not skip the JS for this WebIDL:

Promise<undefined> replaceTrack((MediaStreamTrack or RTCRtpReceiver)? withTrackOrReceiver);

Apps could then declare encoded-level forwarding explicitly:

recvPc.ontrack = async ({receiver}) => await relayPc.getSenders()[0].replaceTrack(receiver);

The app can fail-over using replaceTrack to a different RTCRtpReceiver.

youennf commented 6 months ago

I understand we got here incrementally, but I don't see why JS is needed when we can add first-class support for fanout.

AIUI, the desire is for the web page to act a little bit like an SFU. I do not think the UA will be able to implement everything that a web page could do, for instance:

youennf commented 6 months ago

I'm fine with proceeding with this shape if you and @jan-ivar are also OK with it.

This API shape seems indeed to have some benefits over the transfer based API, I would tend to proceed with this one.

jan-ivar commented 6 months ago

AIUI, the desire is for the web page to act a little bit like an SFU.

Is the goal still to repurpose RTCRtpScriptTransform to run an SFU in JS?

For example, previous choices there like RTCEncodedVideoFrame being mutable and its serialization steps not copying the ArrayBuffer, was tailored to the simple frame modification use cases of encrypt/decrypt and add metadata.

How do we reconcile that SFU use cases seem better served by immutable RTCEncodedVideoFrames (per the tee problem? Does not a perfect fit mean not the right surface?

https://github.com/w3c/webrtc-encoded-transform/issues/134 discusses creating one from another, but it's not clear how one would go from immutable to mutable without a copy.

dontcallmedom-bot commented 5 months ago

This issue was discussed in WebRTC March 26 2024 meeting – 26 March 2024 (RTCRtpEncodedSource)

alvestrand commented 5 months ago

We can argue that mutable RTCEncodedVideoFrame was a design mistake at the time. However, the prospect of dealing with two variants of RTCEncodedVideoFrame, one mutable and one not, does not exactly fill me with joy. We should ensure that the implementation of RTCEncodedVideoFrame is performant for the non-mutating case, and users will get the point when they need the performance.