Allow decoder to ignore corrupted frames

matanui159 commented 1 year ago

Currently the spec is written that (for example) for the VideoDecoder, as soon as one frame fails to decode the entire decoder becomes unusable. It is not even possible to reconfigure and start at the next key-frame, the decoder must be entirely recreated if you want to continue.

This works fine for controlled scenarios, either using known videos or from a known encoder. However, for arbitrary files provided by users this might be unexpected. Most video players (including the browsers own video element usually) seem to just drop the frame. This will produce garbage until the next keyframe but videos with only a few corrupted frames are surprisingly common. In other APIs like FFmpeg's libavcodec, a decoder can still be used after an error allowing for this use-case.

I was wondering if there was a reason for this, or if there has been considerations to change this behaviour? Maybe behind a decoder configuration option (if certain decoders do not allow this), or make it require reconfiguration or another keyframe?

aboba commented 1 year ago

Can you cite the text that you believe requires this behavior? The text around flush() and reset() would suggest that a decoder error can be non-terminal. For example, an error decoding an SVC extension layer frame shouldn't prevent decoding of subsequent base layer frames.

matanui159 commented 1 year ago

In VideoDecoder.decode(chunk) one of the steps is:

If decoding results in an error, queue a task to run the Close VideoDecoder algorithm with EncodingError and return.

Once the decoder is in the "closed" state, both configure() and reset() will fail:

If state is "closed", throw an InvalidStateError.

dalecurtis commented 1 year ago

We could probably clarify that text to differentiate between fatal and non-fatal errors. Unfortunately that's going to vary by implementation. E.g., for security reasons all errors in Chromium's software decoder are fatal.

In general though I don't think the UA should be deciding what's recoverable and what's not. The point of a low level API like this is to defer that decision to the client. If the client wishes to allow playback to continue by resuming from the next keyframe they can respin a decoder and do so. Most professional content sites would not want such behavior since it could hide corpus errors.

For codecs which have built in error resilience, the current text isn't intended to prevent them from doing so. So we could at least clarify that too.

seflless commented 1 year ago

Does anyone on here have an example video file with one or more corrupted frames?

seflless commented 1 year ago

I'd love to add one to a test suite.

dalecurtis commented 1 year ago

https://source.chromium.org/chromium/chromium/src/+/main:media/test/data/ has a couple corrupt vp8 frames. In general we generate corrupt frames for testing either through fuzzers or by truncating valid data.

matanui159 commented 1 year ago

E.g., for security reasons all errors in Chromium's software decoder are fatal.

I have since noticed that HTMLVideoElement completely stops working as well once it finds a corrupted frame. Is this relatively new? I recall it working fine after corrupted frames but I might be remembering incorrectly.

Does anyone on here have an example video file with one or more corrupted frames?

The way I tested it was just by opening the video in a hex editor and going to some random point in the file and setting a byte or two to 0x00 (and hoping it doesn't completely break the file, although most of the data is the video anyway).

I'll have a play around with recreating the WebCodecs decoder and waiting for the next keyframe to see how well that works.

dalecurtis commented 1 year ago

Chrome has always stopped on video decoding errors and stops on audio decoding errors as of several years ago.

matanui159 commented 1 year ago

I tried another corrupted video and while WebCodecs still fails on the first corrupted frame, HTMLVideoFrame seems to be able to play through the entire thing just fine?

Also a smaller issue but the in the spec:

If decoding results in an error, queue a task to run the Close VideoDecoder algorithm with EncodingError and return.

Where EncodingError is described as "The encoding operation (either encoded or decoding) failed.".

However for Chrome, which I've been testing on, it instead reports an OperationError (with the message set to "Decoding error.").

padenot commented 1 year ago

We discussed this during the editor call:

What's the difference between resetting a decoder internally and tearing one down and getting a new one, in practice, in terms of duration? This can translate directly in quality of experience (time to recover).
If a decoder that has a number of instances that's limited is closed because of an error, can't another app acquire it, and so the web app can't re-create one?
Should we have a policy of always making all errors fatal like Chrome's current implementation?

dalecurtis commented 1 year ago

I think the depth of reset() varies by implementation. We've certainly had some hardware decoder instances in the past where reset() is insufficient to recover them. From a security perspective you'd want to be sure that all corrupt state is purged, it's unclear without inspecting each platform's implementation that is true. Which from a security pov means it can't be relied upon.
On platforms where there are such low limits on codec concurrency, you're also unlikely to have high app concurrency, so I think the risk of another app taking the codec is not worth worrying about.
The spec text currently says errors are fatal, so currently implementations are required to error out. What constitutes an error will depend on the codec though -- e.g., built in error resilience may mean no error is surfaced.

dalecurtis commented 1 year ago

I tried another corrupted video and while WebCodecs still fails on the first corrupted frame, HTMLVideoFrame seems to be able to play through the entire thing just fine?

Likely this means the hardware decoder doesn't like the content but the software decoder can decode it okay. Chrome currently doesn't implement fallback for decoding with no-preference -- we were about to but then #604 came up for discussion and we haven't got back to it.

You likely want to use prefer-software for this case if the hardware decoder can't decode it.

Also a smaller issue but the in the spec:

If decoding results in an error, queue a task to run the Close VideoDecoder algorithm with EncodingError and return.

Where EncodingError is described as "The encoding operation (either encoded or decoding) failed.".

However for Chrome, which I've been testing on, it instead reports an OperationError (with the message set to "Decoding error.").

Whoops, filed https://bugs.chromium.org/p/chromium/issues/detail?id=1429352

aboba commented 1 year ago

How does the application figure out what to do next?

If "prefer-hardware" was used to configure the (now closed) decoder, should the application attempt to construct another decoder and configure it to "prefer-hardware"? This might work if the decoder experienced an error but hardware resources are available.
Or should it attempt to create another decoder and configure it with "prefer-software" (e.g. software failover)? If hardware resources are not available or there is another serious issue (e.g. GPU crash) then Option 1 might fail, and failover to software might be required.

aboba commented 1 year ago

I am not sure how this issue relates to "error resilient codecs" (e.g. codecs configured with a scalabilityMode, Opus with built-in FEC or external FEC/RTX/RED).

For a codec configured with a scalabiltyMode, the receiver should check dependencies and chains before feeding an encoded chunk to the decoder, to make sure that the chunk will be decodable. If the dependencies are met and the chains are unbroken, the chunk should be decodable even if other (discardable) frames have been lost. So no decoder error should be generated.

For a codec with internal resilience (e.g. Opus with FEC or DRED), we've discussed how to indicate missing frames to the decoder. Presumably, this wouldn't result in a decoder error unless the codec couldn't reconstruct the missing frames.

In the case of external resilience, the frame isn't fed to the decoder until "holes" in the jitter buffer are filled in and the frame has been wholly received or reconstructed. Again, I wouldn't expect a decoder error unless something has gone wrong with external reslience.

Djuffin commented 1 year ago

Media WG meeting: we're going to review spec exceptions for encoder() and configure() and see if we should introduce a distinction retryable and non-retryable errors (UA having some internal failures vs corrupted data error).

chrisn commented 1 year ago

Meeting minutes: https://www.w3.org/2023/04/11-mediawg-minutes.html#t01

aboba commented 1 year ago

To be resolved via https://github.com/w3c/webcodecs/issues/669

w3c / webcodecs

Allow decoder to ignore corrupted frames #656