Open guidou opened 4 months ago
cc @Djuffin, @padenot , @youennf, @aboba
Does this PR imply any behavior in WebCodecs API?
For example, on encoding is there an expectation that VideoFrame.captureTime
is copied to EncodedVideoChunk.captureTime
? Or on decoding is EncodedVideoChunk.receiveTime
or EncodedVideoChunk.rtpMetadata
to be copied to VideoFrame.receiveTime
or VideoFrame.rtpMetadata
?
If there are no changes in behavior (e.g. if the attributes don't affect the encode or decode process or some other aspect of WebCodecs) then the attributes could be defined in another specification where behavior is affected (e.g. mediacapture-transform?), and added to the VideoFrame Metadata Registry.
This PR as currently written does not imply any behavior in the WebCodecs API, although I would expect the things you mentioned (e.g., forwarding them to/from EncodedVideoChunk) as potentially useful.
The idea for this PR is to provide information to applications so that they can do similar things to what they can do with requestVideoFrameCallback
(e.g., better A/V sync and delay measurements). This doesn't require any other behavior changes in WebCodecs (at least for applications using mediacapture-transform + WebRTC).
I think we can specify forwarding to EncodedVideoChunk in a separate PR since this one has value on its own without specifying further changes to WebCodecs.
I used to be skeptical about these timestamps since they are not passed through the encoding-decoding cycle, but since we already have entries in VideoFrame Metadata Registry that don't do that, I think it's okay now.
And RTC software like Teams, Mean and Facetime can really use it for A/V sync and latency estimation, even if they have to pass this information via separate channels. So LGTM
I agree that this metadata is useful. The question is whether behavior is well specified, so that interop is possible. For example, there is the question of where the metadata orginates:
MediaStreamTrackProcessor
method provide VideoFrame.captureTimestamp
if the MST is obtained from a local capture? MediaStreamTrackProcessor
method provide VideoFrame.receiveTimestamp
and VideoFrame.rtpMetadata
if the MST is obtained remotely via WebRTC-PC? I thought for all metadata entries the answer to these questions is MAY.
@Djuffin MAY might be ok for these metadata fields. However, is alignment of VideoFrame.timestamp
and EncodedVideoChunk.timestamp
optional for WebCodecs implementations?
I thought for all metadata entries the answer to these questions is MAY.
I agree from a WebCodecs POV. But it is not sufficient from an interop point of view. Probably each spec defining a MST video source should describe which metadata it generates, just like each spec defines which constraints are supported by a given source. Putting the definition at the source ensures the same metadata is exposed via MSTP or via VideoFrame constructor (from a video element).
That would mean mediacapture-main and webrtc-pc here. As of mediacapture-transform VideoTrackGenerator, nothing seems needed though we could add a note stating that metadata are preserved.
FWIW, the requestVideoFrameCallback
spec where these fields are originally defined say that captureTime
applies to local cameras and remote frames (WebRTC), receiveTime
to WebRTC frames, and rtpTimestamp
to WebRTC frames. But I agree with @youennf that having each MST source spec indicate the metadata it generates is the best way to organize that.
In any case, we need to have entries for these fields in the VideoFrameMetadata registry.
Media WG meets today, please add agenda label if you'd like to discuss.
@Djuffin MAY might be ok for these metadata fields. However, is
VideoFrame.timestamp
andEncodedVideoChunk.timestamp
optional for WebCodecs implementations?
They're mandatory.
Is there some kind of deep connection here that I miss?
Minutes from 9 July 2024 Media WG meeting. @aboba summarised the conclusion in https://github.com/w3c/webcodecs/pull/813#pullrequestreview-2169568794.
Summary of WG discussion: HTMLVideoElement.requestVideoFrameCallback is not the best spec to reference here, because it doesn't describe how and when these timestamps are set. Corresponding changes need to be made in MediaStreamTrackProcessor and Media Capture and Streams specs. Something along the lines: "MediaStreamTrackProcessor sets capture timestamps for VideoFrames coming from camera..."
Later this PR should reference these specs.
"And RTC software like Teams, Mean and Facetime can really use it for A/V sync and latency estimation, even if they have to pass this information via separate channels."
[BA] To do A/V sync, captureTime
and receiveTime
need to be provided for both audio and video.
Also, if they are to be usable for non-RTP transports, they need to be defined in a way that is independent of RTP/RTCP. For example, on the local peer, captureTime
represents the capture time of the first byte according to the local wallclock. On a remote peer, captureTime
is set by the receiver. For example, the local peer's captureTime
can be serialized on the wire and then set on the receiver (e.g. not adjusted to the receiver wallclock). receiveTime
is set on the receiver, based on the receiver's wallclock. (receiveTime
- captureTime
) can then be used to estimate the sender/receiver offset as well as Jitter.
This PR has been updated to reference mediacapture-extensions where these concepts are now properly defined (similar to human face segmentation).
These fields are useful for WebRTC-based applications. See issue #601