w3c / webcodecs

WebCodecs is a flexible web API for encoding and decoding audio and video.
https://w3c.github.io/webcodecs/
Other
977 stars 136 forks source link

Reference frame control #285

Open aboba opened 3 years ago

aboba commented 3 years ago

In situations where an application desires to use custom scalability mode (e.g. a mode other than one of the scalabilityMode values defined in WebRTC-SVC), or perhaps in response to loss, it may be desirable for the application to control the reference frames used in the encoding process.

aboba commented 2 years ago

One potential way forward would be to add dependsOnIds to VideoEncoderEncodeOptions

Djuffin commented 3 months ago

We'd like to have some progress on this issue in the near future.

General concept for manual temporal scalability

Frame dependencies are described via the concept of encoder buffers:

Outline of the API extension for reference frame control

  1. A new setting in the VideoEncoderConfig to activate manual reference picture selection.
  2. A new opaque data type - internal encoder’s buffer(VideoEncoderBuffer). It provides a codec-agnostic way to reference an encoding buffer.
  3. VideoEncoderEncodeOptions (the second argument of VideoEncoder.encode() ) needs a new field to describe dependencies.
partial dictonary VideoEncoderConfig {
    DOMString scalabilityMode; // new value "manual"
};

// This interface can't be constructed, it can only be obtained via calls to VideoEncoder
interface VideoEncoderBuffer {
    DOMString id; 
};

partial dictionary VideoEncoderEncodeOptions {
    // buffers that can be used for inter-frame prediction while encoding a given      
    // frame. If this array is empty we basically ask for an intra-frame.
    sequence<VideoEncoderBuffer> referenceBuffers;

    // a buffer where the encoded frame should be saved after encoding
    VideoEncoderBuffer updateBuffer;    
};

partial interface VideoEncoder {
    // get a list of all buffers that can be used while encoding
    sequence<VideoEncoderBuffer> getAllFrameBuffers();  
}
sprangerik commented 3 months ago

A few comments on the above proposal on reference frame control...

To start with, a nit: VideoEncoderEncodeOptions.updateBuffer needs to be optional. Typically frames belonging to the top temporal layer are not used as references and thus do not need to update any buffer.

Understanding Constraints

In order for this to actually work, there are a few more things we need. Specifically:

  1. The ability to create a VideoEncoder instance representing a single underlying instance of a single implementation
  2. The ability to reason about the constraints for that implementation

Rationales:

(1) Manually creating a reference structure requires the user to understand what any given reference buffer contains - at any given point in time. This means that any run-time change to the implementation (things such as automatic fallback from hardware to software encoding or recreation of an encoder due to error, etc) will cause the state to change and leave the user in a very confusing position. To avoid this, I suggest that encoders used for manual mode need to be created with a flag to indicating which exact implementation to use.

(2) Furthermore, we need to know how a given implementation allows reference buffers to be used. In particular for this proposal we need:

What we mean by "start frame" here is an all-intra frame from which a new decoder can start decoding. So in essence a key-frame, but one that does not implicitly reset all other reference buffers and codec state. A start frame could be requested with the above proposal by letting VideoEncoderEncodeOptions.referenceBuffers be an empty set.

Limiting the Scope

While we aim for an API that would support a rich set of RTC related features, it is probably for the best to do that in steps. As a first step the scope could be limited to...

Even with such a reduced scope it’s still possible to implement a large variety of use cases including temporal layers, LTR and other forms of recovery frames, pseudo B-frames and more.

Implications for Encoders

The reference frame control scheme discussed here is codec agnostic. It can work for H26x, VPx or AV1. It does however come with some codec and implementation specific implications. We’ll not go into detail of all of them here, but as an example we do not explicitly model other codec state, such as probabilities for entropy coding.

Instead we tie the state to reference buffers so that if an encoding references a buffer it also implicitly depends on any other state needed to decode that reference. It is otherwise up to codec implementers to set codec specifics such as “resilient mode” in a way that allows this model to work.

Concrete Suggestion / Example

An example of what the API could look like:

// Uniquely identifies an encoder implementation.
interface VideoEncoderIdentifier {
  DOMString id;                  // unique identifier for this entry
  DOMString codecName;           // e.g. "av01.0.04M.08"
  DOMString implementationName;  // e.g. "libaom"
};

dictionary VideoEncoderPredictionConstraints {
  unsigned long maxReferencesPerFrame;
}

dictionary VideoEncoderPerformanceCharacteristics {
  boolean is_hardware;
}

dictionary VideoEncoderCapabilities {
  VideoEncoderPredictionConstraints prediction_constraints;
  VideoEncoderPerformanceCharacteristics performance_characteristics;
};

partial interface VideoEncoder {
  // Get a map containing all the available video encoder implementations and their respective capabilities.
  static record<VideoEncoderIdentifier, VideoEncoderCapabilities> enumerateAvailableImplementations();  
}

partial dictionary VideoEncoderInit {
  optional DOMString encoderId;  // matches id of a VideoEncoderIdentifier
};

The intended workflow is to:

  1. Enumerate the available implementations
  2. Find the implementation that suits your needs, and create a new VideoEncoder instance with the id matching your request added to the constructor parameters.
  3. Configure the VideoEncoder instance with the scalability mode set to “manual” in the VideoEncoderConfig
  4. Submit frames with reference control parameters set to whatever you want, as long as they fulfill the requirements that were specified for implementation from (2)

Next Steps

What are the next steps? Do we want to discuss all the details outlined above in this issue, or do we split it up? What parts do you feel are the most important to discuss?

aboba commented 3 months ago

The extensions to support reference buffers appear straightforward, but I have some questions about the discovery functionality:

sprangerik commented 3 months ago

This requires a new API, rather than an extension to isConfigSupported() or Media Capabilities, correct?

Correct, I don't think it's feasible to reuse those APIs. You'd have to basically do query for every permutation of references you'd want to make to see if it would work. That said, isConfigSupported() could still be used in conjunction with this - for instance to see if CBR is supported in combination with "manual" scalability mode for the given encoder you've created. Now that too has some issues. For instance the VideoEncoderConfig.hardwareAcceleration is basically a noop if you have created a fixed implementation. I can also see argument for going in the other direction and removing isConfigSupported entirely for these fixed-implementation-instances and have all the information about the encoder in VideoEncoderCapabilities.

How significant are the additions to the fingerprinting surface? ...

I'd say it's not really increased at all for most users. All the metadata about the encoder follows directly from the implementation in use. So even if there are a lot of new fields, they don't actually add any more information usable for fingerprinting than what is already there. This just adds more useful structure to that information.

In fact you don't even need GPU/OS information from other APIs - just the current WebCodecs will do:

Even if we try to inject noise to prevent that sort of scheme, in my experience it's relatively easy to identify an encoder implementation by even just parsing the uncompressed header of a VPx/AV1 frame and looking at how it structures the data. So by allowing the encoder to be used it's already outed...

The only new information I can see is if this mechanism allows access to encoders that were previously not used. E.g. I'm unsure about which implementation is selected by the browser if laptop with dual GPUs is used. Presumably, this new mechanism would expose both of them and allow the user to select which one it prefers. This still seems like an overall benefit to me. Then again, maybe this information too is already available via other GPU-centric APIs? I have not looked deeply into that.