Add captureTime, receiveTime and rtpMetadata to VideoFrameMetadata

guidou commented 4 months ago

These fields are useful for WebRTC-based applications. See issue #601

guidou commented 4 months ago

cc @Djuffin, @padenot , @youennf, @aboba

aboba commented 4 months ago

Does this PR imply any behavior in WebCodecs API?

For example, on encoding is there an expectation that VideoFrame.captureTime is copied to EncodedVideoChunk.captureTime? Or on decoding is EncodedVideoChunk.receiveTime or EncodedVideoChunk.rtpMetadata to be copied to VideoFrame.receiveTime or VideoFrame.rtpMetadata?

If there are no changes in behavior (e.g. if the attributes don't affect the encode or decode process or some other aspect of WebCodecs) then the attributes could be defined in another specification where behavior is affected (e.g. mediacapture-transform?), and added to the VideoFrame Metadata Registry.

guidou commented 4 months ago

This PR as currently written does not imply any behavior in the WebCodecs API, although I would expect the things you mentioned (e.g., forwarding them to/from EncodedVideoChunk) as potentially useful.

The idea for this PR is to provide information to applications so that they can do similar things to what they can do with requestVideoFrameCallback (e.g., better A/V sync and delay measurements). This doesn't require any other behavior changes in WebCodecs (at least for applications using mediacapture-transform + WebRTC).

guidou commented 4 months ago

I think we can specify forwarding to EncodedVideoChunk in a separate PR since this one has value on its own without specifying further changes to WebCodecs.

Djuffin commented 4 months ago

I used to be skeptical about these timestamps since they are not passed through the encoding-decoding cycle, but since we already have entries in VideoFrame Metadata Registry that don't do that, I think it's okay now.

And RTC software like Teams, Mean and Facetime can really use it for A/V sync and latency estimation, even if they have to pass this information via separate channels. So LGTM

aboba commented 4 months ago

I agree that this metadata is useful. The question is whether behavior is well specified, so that interop is possible. For example, there is the question of where the metadata orginates:

MAY/SHOULD/MUST the MediaStreamTrackProcessor method provide VideoFrame.captureTimestamp if the MST is obtained from a local capture?
MAY/SHOULD/MUST the MediaStreamTrackProcessor method provide VideoFrame.receiveTimestamp and VideoFrame.rtpMetadata if the MST is obtained remotely via WebRTC-PC?

Djuffin commented 4 months ago

I thought for all metadata entries the answer to these questions is MAY.

aboba commented 4 months ago

@Djuffin MAY might be ok for these metadata fields. However, is alignment of VideoFrame.timestamp and EncodedVideoChunk.timestamp optional for WebCodecs implementations?

youennf commented 4 months ago

I thought for all metadata entries the answer to these questions is MAY.

I agree from a WebCodecs POV. But it is not sufficient from an interop point of view. Probably each spec defining a MST video source should describe which metadata it generates, just like each spec defines which constraints are supported by a given source. Putting the definition at the source ensures the same metadata is exposed via MSTP or via VideoFrame constructor (from a video element).

That would mean mediacapture-main and webrtc-pc here. As of mediacapture-transform VideoTrackGenerator, nothing seems needed though we could add a note stating that metadata are preserved.

guidou commented 4 months ago

FWIW, the requestVideoFrameCallback spec where these fields are originally defined say that captureTime applies to local cameras and remote frames (WebRTC), receiveTime to WebRTC frames, and rtpTimestamp to WebRTC frames. But I agree with @youennf that having each MST source spec indicate the metadata it generates is the best way to organize that.

In any case, we need to have entries for these fields in the VideoFrameMetadata registry.

chrisn commented 4 months ago

Media WG meets today, please add agenda label if you'd like to discuss.

Djuffin commented 4 months ago

@Djuffin MAY might be ok for these metadata fields. However, is VideoFrame.timestamp and EncodedVideoChunk.timestamp optional for WebCodecs implementations?

They're mandatory.

Is there some kind of deep connection here that I miss?

chrisn commented 4 months ago

Minutes from 9 July 2024 Media WG meeting. @aboba summarised the conclusion in https://github.com/w3c/webcodecs/pull/813#pullrequestreview-2169568794.

Djuffin commented 4 months ago

Summary of WG discussion: HTMLVideoElement.requestVideoFrameCallback is not the best spec to reference here, because it doesn't describe how and when these timestamps are set. Corresponding changes need to be made in MediaStreamTrackProcessor and Media Capture and Streams specs. Something along the lines: "MediaStreamTrackProcessor sets capture timestamps for VideoFrames coming from camera..."

Later this PR should reference these specs.

aboba commented 2 months ago

"And RTC software like Teams, Mean and Facetime can really use it for A/V sync and latency estimation, even if they have to pass this information via separate channels."

[BA] To do A/V sync, captureTime and receiveTime need to be provided for both audio and video.

Also, if they are to be usable for non-RTP transports, they need to be defined in a way that is independent of RTP/RTCP. For example, on the local peer, captureTime represents the capture time of the first byte according to the local wallclock. On a remote peer, captureTime is set by the receiver. For example, the local peer's captureTime can be serialized on the wire and then set on the receiver (e.g. not adjusted to the receiver wallclock). receiveTime is set on the receiver, based on the receiver's wallclock. (receiveTime - captureTime) can then be used to estimate the sender/receiver offset as well as Jitter.

guidou commented 1 week ago

This PR has been updated to reference mediacapture-extensions where these concepts are now properly defined (similar to human face segmentation).

w3c / webcodecs

Add captureTime, receiveTime and rtpMetadata to VideoFrameMetadata #813