Add keyframe interval support.

handellm commented 1 year ago

This PR adds support for video key frame configurability in the MediaRecorder API.

alvestrand commented 1 year ago

I think this is OK. Was wondering whether the interval is reset when the encoder decides on its own to insert a keyframe, but concluded after reading again that the language is sufficiently nonrestrictive that it does not matter.

alvestrand commented 1 year ago

Would be good to have 2 slides on this at the February interim, just to make sure it's been on the agenda.

Pehrsons commented 1 year ago

To my understanding most encoders configure keyframe intervals in number of frames, not based on time. What's the intended way for an implementation to convert the time interval to a frame interval given that video MediaStreamTracks by definition have no fixed frame rate?

Would it be a better fit to define the keyframe interval as a frame interval? The application might have a better understanding of the rate at which it will feed frames to the MediaRecorder. Is there specific use case where a time interval is a better fit?

One of the main use cases for recurring keyframes in a non-streamed setting that I can think of is to be able to seek to any given point in the file by jumping to the keyframe prior to that point and decoding your way there. The speed of this process is determined by the number of frames between the keyframe and the point one is seeking to. So to be able to get consistent seek times with a variable frame rate, a key frame interval expressed as a frame interval seems favorable.

A min and a max interval seems like an even better fit for most encoders, but I guess MediaRecorder could be free to infer these.

youennf commented 1 year ago

To my understanding most encoders configure keyframe intervals in number of frames, not based on time.

On macOS/iOS you have the choice. In many case, there is the possibility to mark a specific frame to be encoded as key frame (see https://www.w3.org/TR/webcodecs/#dictdef-videoencoderencodeoptions for instance).

For what is worth, WebKit is doing the following: a key frame every 2 seconds or 60 frames, whichever comes first. I wonder whether we could provide suggestions to implementors for the default value, as a non normative statement for instance.

Getting a better sense of the use case to go with either time or/and frame count would be good.

handellm commented 1 year ago

Is there an example of an encoder where per-frame control of keyframes isn't possible?

The use case is to let the MediaRecorder support basic control over seekability without requiring post-processing and give the user a basic knob to control the tradeoff between muxed size, quality & seek accuracy. Another useful consequence is also that parts of the stream can be excised with specified temporal accuracy without requiring transcoding.

Is there specific use case where a time interval is a better fit?

My bet is the user thinks more about seconds than about frame intervals. I struggle to see use cases where a count interval will be better. While it's possible to influence min and max track framerates with getDisplayMedia/getUserMedia and applyConstraints, my understanding is the default source behavior is unspecified. Which means a frame interval gives an undefined temporal accuracy in such a case, or when it's impossible to know the configured rates.

A consequence of a temporal interval is that muxed streams with only keyframes may result. If the output bitrate is constrained, the muxed stream may be of lower quality in stretches with low framerate. But MediaRecorder is supposed to be a simple API. Where more refined control than "increased bitrate config" is needed, the user can always go to WebCodecs.

With the same rationale, I think providing both count & temporal control + logic spec might not be needed for this API.

I wonder whether we could provide suggestions to implementors for the default value, as a non normative statement for instance.

Do you have a suggestion in mind?

youennf commented 1 year ago

Do you have a suggestion in mind?

I would look at whether Chrome and Firefox have defaults (or will have after this change), similarly to WebKit ones. Maybe we can somehow harmonise values.

handellm commented 1 year ago

Yes, that makes sense. The current story for Chrome is that there are generally no keyframes forced midstream, except if the mimeType results in getting an accelerated codec, then you get a count based interval of 100 frames (which plays badly with lower framerate input streams). Happy to change this closer to FF/Safari.

henbos commented 1 year ago

I prefer interval over number of frames because I think key frames have a lot to do with seekability, and if this was a count instead of an interval I feel like you'd end up having to calculate it anyway. That said this isn't really my area, so feel free to continue the discussion. Either way I'm happy, +1

Pehrsons commented 1 year ago

Is there an example of an encoder where per-frame control of keyframes isn't possible?

Is there a risk with the per-frame API for keyframes, that the encoded stream has more visual artifacts as a result of the encoder having a harder time predicting how to allocate the target bitrate, than if it were to configure the encoder with a per-frame interval?

I have seen behavior in libvpx suggesting this, but I'm not sure how much of it was due to using VPX_EFLAG_FORCE_KF and how much was due to a libvpx bug.

IIRC Firefox sets a keyframe interval of 10 seconds, for historical reasons [1], and occasionally recalculates the frame-interval based on recent incoming framerate, so it can configure the encoder with this.

[1] our ebml writer only creates a new cluster on a keyframe, and the max timestamp within a cluster is 32 seconds.

Pehrsons commented 1 year ago

I prefer interval over number of frames because I think key frames have a lot to do with seekability, and if this was a count instead of an interval I feel like you'd end up having to calculate it anyway.

I'm not sure what you mean by seekability, but consider these cases:

(frame interval) Every chunk contains 1 keyframe and 59 non-keyframes. Seeking to the middle means decoding 30 frames. No matter where in the file you want to seek, the mean number of frames to decode will be 30.
(time interval) Every chunk is 1 second. One chunk is 30 fps and contains 1 keyframe and 29 non-keyframes. Seeking to the middle means decoding 15 frames. Another chunk is 120fps and contains 1 keyframe and 119 non-keyframes. Seeking to the middle means decoding 60 frames. For any given seek operation, the number of frames to decode will depend on the framerate around that time.

This is the tradeoff I see. Perhaps there's something to be said about predictability and bitrate allocation (and if that is the case, it would be in favor of frame intervals?) but I don't really know.

handellm commented 1 year ago

Is there a risk with the per-frame API for keyframes, that the encoded stream has more visual artifacts as a result of the encoder having a harder time predicting how to allocate the target bitrate, than if it were to configure the encoder with a per-frame interval?

Is there specific use case where a time interval is a better fit?

It doesn't have anything to do with per-frame API, but with a fixed bitrate budget, quality per frame becomes lower as the ratio of num keyframe/num deltaframes gets higher. When you think about it this is more likely to happen with time-based control with for example low-framerate slideware. With ultimate control over keyframe generation, this seems like a situation a count-based control makes sense. (Note, ultimate control isn't entirely possible. Encoders anyway emit keyframes when diffs to reference frames becomes too large, so count-based will not be predictable in this situation.)

This is the tradeoff I see.

Worth mentioning here's it's also the case that with count-based intervals you can't excise with specified accuracy. Say we've recorded a presentation with occasional slide changes. You'd like to save 0.58 minutes - 1.30 minutes in a webm somewhere else. If the framerate was low in the 0.58 section, with count-based intervals chances are you could have to save an area tens of seconds before that point.

How about specifying both? Say MR configuration is 2 parameters X (min time between keyframes) and Y (min framecount between keyframes):

void OnFrame(const VideoFrame& frame) {
  ++currentFrameIndex;
  now = getCurrentTime();
  forceKeyFrame = now - lastKeyFrameTimestamp >= X &&
      currentFrameIndex - lastKeyFrameIndex >= Y;
  if (forceKeyFrame) {
    lastKeyFrameTimestamp = now;
    lastKeyFrameIndex = currentFrameIndex;
  }
  Encode(frame, forceKeyFrame);
}

This supports three cases:

Pure time-based control (set Y to 0)
Pure count-based control (set X to 0)
Compromises (X and Y > 0) for variable-rate frame input.

shangl commented 1 year ago

Looks good to me. From a user perspective, I'd also say that time-based control would make more sense. Can't speak for what's better with encoders though.

handellm commented 1 year ago

Updated to include support for both time and count base interval. PTAL!

alvestrand commented 1 year ago

Editors meeting: @youennf and @jan-ivar had a few comments (editorial and non-editorial) that need addressing.

youennf commented 1 year ago

Two comments:

It would be good to rephrase the use of the new slots like the other slots in terms of "Constrain the configuration of |recorder| to encode"....
In macOS and iOS, when setting both values, I think the min is used (I'll check) but the spec is asking for the max.

handellm commented 1 year ago

Youenn, for some reason I can't reply to your comment.

It would be good to rephrase the use of the new slots like the other slots in terms of "Constrain the configuration of |recorder| to encode"....

Tried to do this in 0df3dba, wdyt?

In macOS and iOS, when setting both values, I think the min is used (I'll check) but the spec is asking for the max.

The rationale for this spec text is what I discussed with Harald earlier:

"The rationale is to support variable-rate sources where you get low per-frame quality (QP) in stretches where FPS is low with a pure time-based control, since the ratio of keyframes (in which quality is lower) to delta frames (in which quality is enhanced) would be higher in the low FPS stretches. See also earlier comment in https://github.com/w3c/mediacapture-record/pull/216#issuecomment-1415989756"

I realize Apple may not currently be using those rules, but on the other hand the spec includes Apple's implementation by "If neither of the slots are defined, the User Agent may emit key frames as it deems fit.". Wdyt?

handellm commented 1 year ago

Friendly ping!

handellm commented 1 year ago

Thanks for comments! New iteration uploaded, PTAL again.

handellm commented 1 year ago

Youenn, any thoughts from you?

jan-ivar commented 1 year ago

Thanks for catching the seconds! Approved with milliseconds to follow § 8.3. Use milliseconds for time measurement and for consistency with existing timeslice.

w3c / mediacapture-record

Add keyframe interval support. #216