Reference frame control

aboba commented 3 years ago

In situations where an application desires to use custom scalability mode (e.g. a mode other than one of the scalabilityMode values defined in WebRTC-SVC), or perhaps in response to loss, it may be desirable for the application to control the reference frames used in the encoding process.

aboba commented 2 years ago

One potential way forward would be to add dependsOnIds to VideoEncoderEncodeOptions

Djuffin commented 3 months ago

We'd like to have some progress on this issue in the near future.

General concept for manual temporal scalability

Frame dependencies are described via the concept of encoder buffers:

Video encoder has a number of buffers which can contain already encoded frames for future reference. These buffers correspond to decoder reference buffers used by the video decoder.
Each encoded frame can be saved into an encoder buffer.
Each encoded frame can depend on a number (zero or more) of previously used encoder buffers.
From bitstream point of view referencing a buffer makes a newly encoded frame depend on the frame that was previously saved into the buffer.

Outline of the API extension for reference frame control

A new setting in the VideoEncoderConfig to activate manual reference picture selection.
A new opaque data type - internal encoder’s buffer(VideoEncoderBuffer). It provides a codec-agnostic way to reference an encoding buffer.
VideoEncoderEncodeOptions (the second argument of VideoEncoder.encode() ) needs a new field to describe dependencies.

partial dictonary VideoEncoderConfig {
    DOMString scalabilityMode; // new value "manual"
};

// This interface can't be constructed, it can only be obtained via calls to VideoEncoder
interface VideoEncoderBuffer {
    DOMString id; 
};

partial dictionary VideoEncoderEncodeOptions {
    // buffers that can be used for inter-frame prediction while encoding a given      
    // frame. If this array is empty we basically ask for an intra-frame.
    sequence<VideoEncoderBuffer> referenceBuffers;

    // a buffer where the encoded frame should be saved after encoding
    VideoEncoderBuffer updateBuffer;    
};

partial interface VideoEncoder {
    // get a list of all buffers that can be used while encoding
    sequence<VideoEncoderBuffer> getAllFrameBuffers();  
}

sprangerik commented 3 months ago

A few comments on the above proposal on reference frame control...

To start with, a nit: VideoEncoderEncodeOptions.updateBuffer needs to be optional. Typically frames belonging to the top temporal layer are not used as references and thus do not need to update any buffer.

Understanding Constraints

In order for this to actually work, there are a few more things we need. Specifically:

The ability to create a VideoEncoder instance representing a single underlying instance of a single implementation
The ability to reason about the constraints for that implementation

Rationales:

(1) Manually creating a reference structure requires the user to understand what any given reference buffer contains - at any given point in time. This means that any run-time change to the implementation (things such as automatic fallback from hardware to software encoding or recreation of an encoder due to error, etc) will cause the state to change and leave the user in a very confusing position. To avoid this, I suggest that encoders used for manual mode need to be created with a flag to indicating which exact implementation to use.

(2) Furthermore, we need to know how a given implementation allows reference buffers to be used. In particular for this proposal we need:

The total number of available reference buffers (though presumably VideoEncoder.getAllFrameBuffers() can fulfill that)
The maximum number references allowed per frame
The frame types the encoder supports (keyframe, delta frame, start frame)

What we mean by "start frame" here is an all-intra frame from which a new decoder can start decoding. So in essence a key-frame, but one that does not implicitly reset all other reference buffers and codec state. A start frame could be requested with the above proposal by letting VideoEncoderEncodeOptions.referenceBuffers be an empty set.

Limiting the Scope

While we aim for an API that would support a rich set of RTC related features, it is probably for the best to do that in steps. As a first step the scope could be limited to...

Only allow manual mode/capability querying for encoders that actually support reference frame control. Other encoders can still be used via the current control scheme.
Only support CQP mode. This avoids complexities related to rate allocation in layered encodings.
No support for reference frame scaling or spatial layers. These features are a bit more complex compared to temporal layering.
Support only named reference buffers (not picture index references)

Even with such a reduced scope it’s still possible to implement a large variety of use cases including temporal layers, LTR and other forms of recovery frames, pseudo B-frames and more.

Implications for Encoders

The reference frame control scheme discussed here is codec agnostic. It can work for H26x, VPx or AV1. It does however come with some codec and implementation specific implications. We’ll not go into detail of all of them here, but as an example we do not explicitly model other codec state, such as probabilities for entropy coding.

Instead we tie the state to reference buffers so that if an encoding references a buffer it also implicitly depends on any other state needed to decode that reference. It is otherwise up to codec implementers to set codec specifics such as “resilient mode” in a way that allows this model to work.

Concrete Suggestion / Example

An example of what the API could look like:

// Uniquely identifies an encoder implementation.
interface VideoEncoderIdentifier {
  DOMString id;                  // unique identifier for this entry
  DOMString codecName;           // e.g. "av01.0.04M.08"
  DOMString implementationName;  // e.g. "libaom"
};

dictionary VideoEncoderPredictionConstraints {
  unsigned long maxReferencesPerFrame;
}

dictionary VideoEncoderPerformanceCharacteristics {
  boolean is_hardware;
}

dictionary VideoEncoderCapabilities {
  VideoEncoderPredictionConstraints prediction_constraints;
  VideoEncoderPerformanceCharacteristics performance_characteristics;
};

partial interface VideoEncoder {
  // Get a map containing all the available video encoder implementations and their respective capabilities.
  static record<VideoEncoderIdentifier, VideoEncoderCapabilities> enumerateAvailableImplementations();  
}

partial dictionary VideoEncoderInit {
  optional DOMString encoderId;  // matches id of a VideoEncoderIdentifier
};

The intended workflow is to:

Enumerate the available implementations
Find the implementation that suits your needs, and create a new VideoEncoder instance with the id matching your request added to the constructor parameters.
Configure the VideoEncoder instance with the scalability mode set to “manual” in the VideoEncoderConfig
Submit frames with reference control parameters set to whatever you want, as long as they fulfill the requirements that were specified for implementation from (2)

Next Steps

What are the next steps? Do we want to discuss all the details outlined above in this issue, or do we split it up? What parts do you feel are the most important to discuss?

aboba commented 3 months ago

The extensions to support reference buffers appear straightforward, but I have some questions about the discovery functionality:

This requires a new API, rather than an extension to isConfigSupported() or Media Capabilities, correct?
How significant are the additions to the fingerprinting surface? There will be some additional information made available, such as implementationName, maxReferencesPerFrame, performance_characteristics and predictionConstraints. It seems like this info is largely derivable from info on the GPU and OS platform, obtainable from other APIs. But PING will be likely to ask.

sprangerik commented 3 months ago

This requires a new API, rather than an extension to isConfigSupported() or Media Capabilities, correct?

Correct, I don't think it's feasible to reuse those APIs. You'd have to basically do query for every permutation of references you'd want to make to see if it would work. That said, isConfigSupported() could still be used in conjunction with this - for instance to see if CBR is supported in combination with "manual" scalability mode for the given encoder you've created. Now that too has some issues. For instance the VideoEncoderConfig.hardwareAcceleration is basically a noop if you have created a fixed implementation. I can also see argument for going in the other direction and removing isConfigSupported entirely for these fixed-implementation-instances and have all the information about the encoder in VideoEncoderCapabilities.

How significant are the additions to the fingerprinting surface? ...

I'd say it's not really increased at all for most users. All the metadata about the encoder follows directly from the implementation in use. So even if there are a lot of new fields, they don't actually add any more information usable for fingerprinting than what is already there. This just adds more useful structure to that information.

In fact you don't even need GPU/OS information from other APIs - just the current WebCodecs will do:

Take (or generate) an image with at least some high-frequency information
For each of (prefer hardware, prefer software)
- Create an Encode of the preferred type
- Encode the input image and fetch the output bitstream
- Create a hash the bitstream data
- This hash is now a unique fingerprint of the encoder implementation, which can also be matched to a database if you want to find its name

Even if we try to inject noise to prevent that sort of scheme, in my experience it's relatively easy to identify an encoder implementation by even just parsing the uncompressed header of a VPx/AV1 frame and looking at how it structures the data. So by allowing the encoder to be used it's already outed...

The only new information I can see is if this mechanism allows access to encoders that were previously not used. E.g. I'm unsure about which implementation is selected by the browser if laptop with dual GPUs is used. Presumably, this new mechanism would expose both of them and allow the user to select which one it prefers. This still seems like an overall benefit to me. Then again, maybe this information too is already available via other GPU-centric APIs? I have not looked deeply into that.

w3c / webcodecs