w3c / webcodecs

WebCodecs is a flexible web API for encoding and decoding audio and video.
https://w3c.github.io/webcodecs/
Other
1.01k stars 137 forks source link

Reference frame control #285

Open aboba opened 3 years ago

aboba commented 3 years ago

In situations where an application desires to use custom scalability mode (e.g. a mode other than one of the scalabilityMode values defined in WebRTC-SVC), or perhaps in response to loss, it may be desirable for the application to control the reference frames used in the encoding process.

aboba commented 3 years ago

One potential way forward would be to add dependsOnIds to VideoEncoderEncodeOptions

Djuffin commented 6 months ago

We'd like to have some progress on this issue in the near future.

General concept for manual temporal scalability

Frame dependencies are described via the concept of encoder buffers:

Outline of the API extension for reference frame control

  1. A new setting in the VideoEncoderConfig to activate manual reference picture selection.
  2. A new opaque data type - internal encoder’s buffer(VideoEncoderBuffer). It provides a codec-agnostic way to reference an encoding buffer.
  3. VideoEncoderEncodeOptions (the second argument of VideoEncoder.encode() ) needs a new field to describe dependencies.
partial dictonary VideoEncoderConfig {
    DOMString scalabilityMode; // new value "manual"
};

// This interface can't be constructed, it can only be obtained via calls to VideoEncoder
interface VideoEncoderBuffer {
    DOMString id; 
};

partial dictionary VideoEncoderEncodeOptions {
    // buffers that can be used for inter-frame prediction while encoding a given      
    // frame. If this array is empty we basically ask for an intra-frame.
    sequence<VideoEncoderBuffer> referenceBuffers;

    // a buffer where the encoded frame should be saved after encoding
    VideoEncoderBuffer updateBuffer;    
};

partial interface VideoEncoder {
    // get a list of all buffers that can be used while encoding
    sequence<VideoEncoderBuffer> getAllFrameBuffers();  
}
sprangerik commented 6 months ago

A few comments on the above proposal on reference frame control...

To start with, a nit: VideoEncoderEncodeOptions.updateBuffer needs to be optional. Typically frames belonging to the top temporal layer are not used as references and thus do not need to update any buffer.

Understanding Constraints

In order for this to actually work, there are a few more things we need. Specifically:

  1. The ability to create a VideoEncoder instance representing a single underlying instance of a single implementation
  2. The ability to reason about the constraints for that implementation

Rationales:

(1) Manually creating a reference structure requires the user to understand what any given reference buffer contains - at any given point in time. This means that any run-time change to the implementation (things such as automatic fallback from hardware to software encoding or recreation of an encoder due to error, etc) will cause the state to change and leave the user in a very confusing position. To avoid this, I suggest that encoders used for manual mode need to be created with a flag to indicating which exact implementation to use.

(2) Furthermore, we need to know how a given implementation allows reference buffers to be used. In particular for this proposal we need:

What we mean by "start frame" here is an all-intra frame from which a new decoder can start decoding. So in essence a key-frame, but one that does not implicitly reset all other reference buffers and codec state. A start frame could be requested with the above proposal by letting VideoEncoderEncodeOptions.referenceBuffers be an empty set.

Limiting the Scope

While we aim for an API that would support a rich set of RTC related features, it is probably for the best to do that in steps. As a first step the scope could be limited to...

Even with such a reduced scope it’s still possible to implement a large variety of use cases including temporal layers, LTR and other forms of recovery frames, pseudo B-frames and more.

Implications for Encoders

The reference frame control scheme discussed here is codec agnostic. It can work for H26x, VPx or AV1. It does however come with some codec and implementation specific implications. We’ll not go into detail of all of them here, but as an example we do not explicitly model other codec state, such as probabilities for entropy coding.

Instead we tie the state to reference buffers so that if an encoding references a buffer it also implicitly depends on any other state needed to decode that reference. It is otherwise up to codec implementers to set codec specifics such as “resilient mode” in a way that allows this model to work.

Concrete Suggestion / Example

An example of what the API could look like:

// Uniquely identifies an encoder implementation.
interface VideoEncoderIdentifier {
  DOMString id;                  // unique identifier for this entry
  DOMString codecName;           // e.g. "av01.0.04M.08"
  DOMString implementationName;  // e.g. "libaom"
};

dictionary VideoEncoderPredictionConstraints {
  unsigned long maxReferencesPerFrame;
}

dictionary VideoEncoderPerformanceCharacteristics {
  boolean is_hardware;
}

dictionary VideoEncoderCapabilities {
  VideoEncoderPredictionConstraints prediction_constraints;
  VideoEncoderPerformanceCharacteristics performance_characteristics;
};

partial interface VideoEncoder {
  // Get a map containing all the available video encoder implementations and their respective capabilities.
  static record<VideoEncoderIdentifier, VideoEncoderCapabilities> enumerateAvailableImplementations();  
}

partial dictionary VideoEncoderInit {
  optional DOMString encoderId;  // matches id of a VideoEncoderIdentifier
};

The intended workflow is to:

  1. Enumerate the available implementations
  2. Find the implementation that suits your needs, and create a new VideoEncoder instance with the id matching your request added to the constructor parameters.
  3. Configure the VideoEncoder instance with the scalability mode set to “manual” in the VideoEncoderConfig
  4. Submit frames with reference control parameters set to whatever you want, as long as they fulfill the requirements that were specified for implementation from (2)

Next Steps

What are the next steps? Do we want to discuss all the details outlined above in this issue, or do we split it up? What parts do you feel are the most important to discuss?

aboba commented 6 months ago

The extensions to support reference buffers appear straightforward, but I have some questions about the discovery functionality:

sprangerik commented 6 months ago

This requires a new API, rather than an extension to isConfigSupported() or Media Capabilities, correct?

Correct, I don't think it's feasible to reuse those APIs. You'd have to basically do query for every permutation of references you'd want to make to see if it would work. That said, isConfigSupported() could still be used in conjunction with this - for instance to see if CBR is supported in combination with "manual" scalability mode for the given encoder you've created. Now that too has some issues. For instance the VideoEncoderConfig.hardwareAcceleration is basically a noop if you have created a fixed implementation. I can also see argument for going in the other direction and removing isConfigSupported entirely for these fixed-implementation-instances and have all the information about the encoder in VideoEncoderCapabilities.

How significant are the additions to the fingerprinting surface? ...

I'd say it's not really increased at all for most users. All the metadata about the encoder follows directly from the implementation in use. So even if there are a lot of new fields, they don't actually add any more information usable for fingerprinting than what is already there. This just adds more useful structure to that information.

In fact you don't even need GPU/OS information from other APIs - just the current WebCodecs will do:

Even if we try to inject noise to prevent that sort of scheme, in my experience it's relatively easy to identify an encoder implementation by even just parsing the uncompressed header of a VPx/AV1 frame and looking at how it structures the data. So by allowing the encoder to be used it's already outed...

The only new information I can see is if this mechanism allows access to encoders that were previously not used. E.g. I'm unsure about which implementation is selected by the browser if laptop with dual GPUs is used. Presumably, this new mechanism would expose both of them and allow the user to select which one it prefers. This still seems like an overall benefit to me. Then again, maybe this information too is already available via other GPU-centric APIs? I have not looked deeply into that.

aboba commented 2 months ago

@djuffin @sprangerik @fippo Are extensions also needed to the Decoder API?

For example, how can an application ensure that a potential reference is still available on the decoder (e.g. that it is treated as an LTR)? It seems like this would be needed to implement LTR-based recovery.

For example, instead of recovering a base-layer loss by encoding a keyframe, a sender could use an LTR as a reference for a P-frame. However an SFM would only forward the request for LTR recovery to a sender if it believes that participants have been sent, have received, have decoded and have retained the LTR. If these conditions aren't met, then LTR-referencing P-frame could elicit PLIs or other frame loss indications from (multiple) participants and it would make more sense for the SFM to forward a PLI or FIR to the sender instead.

sprangerik commented 2 months ago

I'd like to avoid the decoder being involved in LTR at all. The only thing really needed is to understand which frames have actually been decoded. It seems much more straight-forward to explicitly ack decoded frames instead, either via something custom (in case of e.g. WebTransport), indirectly via transport-cc (with some complex logic) or via a new simple RTCP message.

aboba commented 2 months ago

The receiver can track what frames are decoded and we have means of communicating what was decoded (e.g. LTN in libwebrtc). If the reference was decoded on the receiver, can an SFM or encoder always assume that it is available for use as a reference? Or is there some point at which the decoder reclaims the reference? As an example, if the GoP is large, the LTR might have been decoded hundreds of frames ago.

sprangerik commented 2 months ago

It's really up to the sender to determine how the buffers are reused. Depending on how you want to implement LTR, the sender could for instance have two buffers reserved for LTR: on that is known to be received and decoded by the remote - and another that has been sent but not yet acknowledged. Once an ack is received for the pending long-term reference, you can safely flip the two buffers, stash a new frame in the previous buffer and repeat this process.

This way you always have a "known good reference" to use if a recovery request arrives.

There are many other ways LTR can be implemented though, and their applicability depends on things like 1:1 vs multi-way, the current RTT, and more. So again, the sending application will be in the best position to know if LTR is suitable at all and if so what type to use.

aboba commented 2 months ago

The sender determines how the buffers are (re)used, but to make the determination it needs info from the potential recipients. RPSI was designed for the 1-1 case where a potential recipient can indicate that they have an LTR, but conferencing is more complicated. Due to potential late joiners who may not have received or decoded the potential LTR, the sender needs to know whether it will receive PLIs in response to a P-frame referencing the potential LTR. That is where custom RTCP messages like LTN come in. If a participant has received and decoded the LTR, can it indicate that in an RTCP message, regardless of how long ago it was decoded? Is there anything the participant needs to do in order to ensure that the reference is indeed stored "long term"?

sprangerik commented 2 months ago

There's nothing really limiting the age of long-term reference buffer, no. As a side note, h26x have support for both short-term and long-term reference buffers with somewhat different reference mechanism. In this proposal we're limiting to the use to just long-term buffers because those map well to VPx/AVx as well, but there's no similar concept to short-term references in those.

So back to your question.. I'd say it's more up to the feedback system - how does it reference previous frames and is there some limiting factor in the format that communicates that? That is a discussion I think we should start, but might be more suitable in a different group (ietf?) since it's really a transport rather than coding related problem.