w3c / webcodecs

WebCodecs is a flexible web API for encoding and decoding audio and video.
https://w3c.github.io/webcodecs/
Other
1k stars 137 forks source link

"Presentation timestamp" is not defined in spec #107

Open alvestrand opened 3 years ago

alvestrand commented 3 years ago

The timestamp attribute of a frame is defined to be the "presentation timestamp", but that term is never defined.

Suggested definition:

The presentation timestamp is an indication of expected relative time of display between two frames. It is not guaranteed to correspond to any real (wall clock) time. For live media, it is RECOMMENDED that the timestamp be the wall clock time of camera capture, to the precision that this can be ascertained.

chcunningham commented 3 years ago

I support this definition. I'll send a PR shortly.

alvestrand commented 3 years ago

Ping - this came up again due to an incompatible implementation in webrtc-encoded-transform; can we nail this down?

tidoust commented 3 years ago

Cc @wolenetz, @tguilbert-google,

Trying to make sure that all specs converge on the same notion, or are at least aware of other contexts where a similar notion is in use, on top of WebCodecs and webrtc-encoded-streams, I note that:

There may be other places where a similar notion is used.

wolenetz commented 3 years ago

I believe the MSE definition is internally consistent. Also, since the extended media element has option to change playbackRate (even for near-live playbacks), the nature of the mapping of MSE presentation timestamps in a media element (possibly adjusted by the coded frame processing algorithm during buffering) is not just relative to wall clock, but proportional to the rate of playback of the presentation. Within the various MSE bytestream format specifications, it may help to further clarify the source of PTS (and DTS if the format and/or codec in the format supports the notion of differing PTS and DTS). I've filed https://github.com/w3c/media-source/issues/292 accordingly.

chcunningham commented 3 years ago

Sorry for the delay.

I'm still ok with what @alvestrand proposed, but I want to dig a bit more on why this is being discussed wrt webrtc-encoded-transform and make sure we're meeting whatever the goal is.

The presentation timestamp is an indication of expected relative time of display between two frames. It is not guaranteed to correspond to any real (wall clock) time.

Looks good. I'd make some minor edits as follows:

The expected time in microseconds when a given VideoFrame or AudioData is expected to be rendered (presented) relative to other VideoFrames or AudioDatas in the media timeline. It is not guaranteed to correspond to any real (wall clock) time.

For live media, it is RECOMMENDED that the timestamp be the wall clock time of camera capture, to the precision that this can be ascertained.

I'm guessing that bit of text is to be read by webrtc-encoded-transform implementers? I'm ok to put that in WC, but maybe it helps visibility for it's intended audience if this recommendation is instead part of webrtc-encoded-transform? WebCodecs doesn't really care, as long as the relative nature of the timestamps is preserved per the first part of the dfn.

Media Source Extensions exports a presentation timestamp definition as "A reference to a specific time in the presentation. The presentation timestamp in a coded frame indicates when the frame SHOULD be rendered".

I think the MSE and proposed WebCodecs defintions are generally in agreement, but MSE get's to focus on "should be rendered" vs "relative to other frames" since MSE actually sees the whole timeline and directly affects rendering.

alvestrand commented 3 years ago

The reason for putting it here rather than in webrtc-encoded-transform is that in a chain consisting of media capture + possible breakout-box processing + WebCodec encoder, there is no webrtc-encoded-transform involved, but I think it's still valuable to have guidance wrt media coming from real-time sources. We have to generate the timestamp at capture time and carry it consistently through the transformation chain, no matter how many steps it has.

chcunningham commented 3 years ago

I see. In that case it seems like adding to the breakout box spec might achieve better visibility? This is basically a capture recommendation. I'm still open to it in WC, just checking what's best.

chcunningham commented 2 years ago

@alvestrand ping on last q^

alvestrand commented 2 years ago

I think VideoFrame's definition actually has more visibility than breakout box (which is in the process of getting FPWD status, but isn't quite there yet).

I'm not sure we (WebRTC's MediaStreamTrack) are going to remain the only source of live capture either. So it would be nice to have it here. But sure, adding it to Breakout Box's MediaStreamTrackProcessor does make sense. I'll do a PR for that.

crisvp commented 1 year ago

@alvestrand Did you end up opening a PR for this? I was not able to find it.

The only reference to timestamps I see in the current draft is a note saying "The application may detect that frames have been dropped by noticing that there is a gap in the timestamps of the frames. "

I'm running into issues I thought were caused by an incomplete definition of "presentation timestamp."

I thought about it a bit more, and now I believe an overly specific definition may cause the issues.

To address the first suggestion first:

The presentation timestamp is an indication of expected relative time of display between two frames. It is not guaranteed to correspond to any real (wall clock) time.

The obvious question, with an ostensibly obvious answer, is "which two frames"? VideoFrame does not specify any context, let alone that all VideoFrames in a context must be handled sequentially or represent the same source. Maybe you want to interleave streams. Who am I to judge? 🤷

The other suggested text:

The expected time in microseconds when a given VideoFrame or AudioData is expected to be rendered (presented) relative to other VideoFrames or AudioDatas in the media timeline. It is not guaranteed to correspond to any real (wall clock) time.

The use case that led me to this issue is obtaining a MediaStream with two tracks: video and audio. I'm running both tracks through MediaStreamTrackProcessor. The resulting VideoFrames have timestamp values based on wall-clock time, while the AudioData timestamps are zero-based from (presumably) the start of the audio track.

With the proposed change, that would remain an acceptable situation. Even though "media timeline" might imply the timeline for the MediaStream (as opposed to the MediaStreamTrack), it is likely producers will continue to choose their own point of reference, within their own perceived "media timeline."

While the producer knows what its relative "media timeline" is, once it encodes its data into VideoFrame or AudioData objects, that knowledge of the "media timeline" is lost, and there is no way to retrieve it. It's not in the object, the spec, or elsewhere. And, without knowing what the "media timeline" is, we still don't know the presentation time.

The specific phrasing "relative to other VideoFrames or AudioDatas in the media timeline" makes it even easier to come to an incorrect conclusion that for my use-case the "media timeline" would be consistent between my VideoFrame sequence and my AudioData sequence.


An additional concern with the current version is that the definitions for timestamp differ between AudioData and VideoFrame. VideoFrame adds "[t]he timestamp is copied from the EncodedVideoChunk corresponding to this VideoFrame." to the definition. Which mostly just raises further questions in the case of breakout box.

And all of the above more or less also holds true for duration which is defined as *"[t]he presentation duration, given in microseconds."