Open alvestrand opened 2 years ago
Another attribute that would be very nice to have is some kind of id that can be matched to unencoded frames processed upstream using the insertable streams api. It seems one reasonable candidate for an id (if we don't want to add a new one) is VideoFrame.timestamp (https://developer.mozilla.org/en-US/docs/Web/API/VideoFrame). It would be nice if this value was propagated through the encoder and attached to the RTCEncodedVideoFrame or RTCEncodedVideoFrameMetadata.
Example use: Updating csrc list (part of RTCEncodedVideoFrameMetadata) to match any mixing/compositing done when processing this particular frame upstream to the encoding.
Hmm, maybe I was confused; I thought we only had RTP timestamp in RTCEncodedVideoFrame, but it also has this timestamp field: https://w3c.github.io/webrtc-encoded-transform/#dom-rtcencodedvideoframe-timestamp
It's poorly documented, but if it's specified to equal the VideoFrame.timestamp of the corresponding unencoded frame, it's should be good enough, I think.
have you seen https://chromium-review.googlesource.com/c/chromium/src/+/3654171 which clarified things a bit? (it is a mess...)
Confusing indeed. Seems this was discussed previously on https://github.com/w3c/webrtc-encoded-transform/pull/116.
Exposing capture time should be reasonably easy on the send side (prototype cl to do that for video: https://webrtc-review.googlesource.com/c/src/+/265403).
I don't really understand the issues in getting that kind of timestamp on the receive side. If it can't easily be set to useful value on receive side, making it optional would be an improvement for use cases I have in mind.
we can't change the definition of timestamp now :-( :+1: for exposing an optional absoluteCaptureTimestamp (present if the abs-capture-time header extension is used?) on RTCEncodedVideoFrameMetadata though (it looks like some code surrounding your CL already does something similar?)
So what if we add two new attributes to RTCEncodedVideoFrameMetadata:
rtpTimestamp (required, representing the uint32_t RTP timestamp)
timestamp (optional, intended to match VideoFrame.timestamp. It's clear to me how that should work on the send side. Less clear on the receive side, could perhaps be conditionally populated depending on presence of absolute capture time extension).
Does that make sense? We could then consider deprecating the current timestamp attribute which isn't in the "metadata".
Less clear on the receive side, could perhaps be conditionally populated depending on presence of absolute capture time extension
I'd go for captureTime -- similar to how it works in requestVideoFrameCallback - @drkron can explain the details. Downside of this is that it is best effort and requires SR or rtrr for clock offset estimation so it won't be available on every frame.
I created a pull request, see https://github.com/w3c/webrtc-encoded-transform/pull/137. It seems I may have to create some w3c account, to get it on track for the process?
indeed @nisse-work although in practice the high-level bit will be your employment situation and its impact on IPR. Please get in touch at dom@w3.org if you want more details.
or talk to @alvestrand ;-)
Back to setMetadata. I can make a pull request to add that, but I need some guidance about how it should work.
dictionary RTCEncodedVideoFrameMetadata
listen in the spec, is that a plain javascript dictionary to the application, so that, e.g., it could be constructed as {"width" : 100}
? It seems clear that this should be the argument type for setMetadata, for consistency with the getMetadata method.Implementing setMetadata requires going deep into the webrtc stack and deal with a number of objects living in different threads. Might be doable synchronously (as getMetadata is), but it would be prudent to try to sketch an implementation before committing to a sync interface.
To sketch this API, we need to identify many use cases, especially if we go with setMetadata. Going with a dedicated csrc/ssrc mutator might be easier to sell if the specific use case is convincing.
As of sync vs. async, if we have to hop to threads, we usually go async. But I think we should first look at the actual model. It seems to me the natural model is that the metadata is only used after the frame is enqueued: at the time of enqueueing the frame, we clone the metadata and this gets used down the pipe. This seems to call for a sync API. The question of bad metadata is something to take into account. We should also learn from WebCodec decoders and metadata approach.
Implementing setMetadata requires going deep into the webrtc stack and deal with a number of objects living in different threads. Might be doable synchronously (as getMetadata is), but it would be prudent to try to sketch an implementation before committing to a sync interface.
I had a quick look at the webrtc implementation of getMetadata, and it appears to return a reference to internal state, and it appeared to make the assumption that it's immutable (and then I suppose chromium makes a copy when packaging it for js access). Making it mutable and properly synchronized looked a bit tricky at first look.
So maybe we shouldn't require synchronous operation.
- I'd suggest that the setMetadata fully replaces the current metadata. So get/modify/set would be the normal pattern. Sending in a dictionary with only the changes makes it hard to describe that you want to delete an attribute.
The current attributes look non-optional to me, so that they can be changed but not deleted. But I guess it's not unlikely that optional attributes will be added later, so good to take that use case into account.
Synchronous = SetMetadata() returns immediately, having affected the metadata Asynchronous = SetMetadata() (probably) posts a task that performs the modification, and returns a promise that will resolve when the modification is done. I don't think exposing the attributes as individually mutable makes sense, and having one mutator for each piece of modifiable metadata seems an invitation to interface cruft.
I also imagine that the only time SetMetadata makes sense is after a frame has been created or received, and before enqueueing it to its destination; modifying it at any other time would be a positive invitation to a race condition.
webcodecs (https://w3c.github.io/webcodecs/#encodedvideochunk-interface) seems to have ignored frame-attached metadata. Instead, it chose to emit EncodedVideoChunkMetadata on its callback interface, alongside the video frame itself: https://w3c.github.io/webcodecs/#encoded-video-chunk-metadata
One of the pieces of their metadata is the "VideoDecoderConfig".
Regarding sync or async: I think it is fine to have success/failure indication be asynchronous. But it would be nice is setMetadata followed by enqueuing of the frame guarantees that the modification takes effect before the frame goes out on the wire, without the javascript having to explicitly wait for setMetadata completion.
Would be nice if we could regard an encodedFrame as a pure data object, which would make it easy to say that operations that modify it are synchronoous.
I came up with another use-case. In the context of a RED encoder I'd like to modify the payload type. Currently it requires me to play a shell game with SDP and switching between opus and red payload types but iff I could modify the payload type that would not be necessary.
I agree that a sync setMetadata call here makes most sense, as one is only mutating local state which is later passed down into the peerconnection and handed off across threads, as Youenn described above. I put together a proof of concept implementation of such an sync API in Chrome & WebRTC, just to make sure we're not missing anything and it seems to work fine. Given this, are there any remaining concerns around making the method synchronous?
The RED example does seem like a good concrete instance of a class of usecases where the encoded transform drastically changes the payload such that the existing metadata no longer matches it. I've not been able to find any good way for apps to handle this without such a setMetadata() method.
In the context of the use case of "take an incoming frame and send it over another PC", we've found the need for setting:
I worry this direction will interfere with https://github.com/w3c/webrtc-encoded-transform/issues/141.
EncodedVideoChunk and EncodedAudioChunk are immutable by design, simplifying serialization and avoids TOCTOU issues.
Instead of making metadata of an existing RTCEncodedVideoFrame mutable, EncodedVideoChunk has a constructor that makes it easy and efficient to copy/transfer-and-modify. Perhaps we could follow a similar pattern here? E.g.
const newFrame = new RTCEncodedVideoFrame({
type: frame.type,
data: frame.data,
metadata: frame.getMetadata(),
transfer: [frame.data]
});
I worry this direction will interfere with #141.
EncodedVideoChunk and EncodedAudioChunk are immutable by design, simplifying serialization and avoids TOCTOU issues.
Instead of making metadata of an existing RTCEncodedVideoFrame mutable, EncodedVideoChunk has a constructor that makes it easy and efficient to copy/transfer-and-modify. Perhaps we could follow a similar pattern here? E.g.
const newFrame = new RTCEncodedVideoFrame({ type: frame.type, data: frame.data, metadata: frame.getMetadata(), transfer: [frame.data] });
We agreed on the constructor approach in the December meeting and sent a PR. Can you take a look? https://github.com/w3c/webrtc-encoded-transform/pull/223
In some applications there is the need to modify the metadata, but the spec only allows a GetMetadata operation. This is inconvenient if the transform operation means that the metadata really should change.
A PutMetadata function would make it simple.