After how many decode should the codec process the frames?

tobiasBora commented 7 months ago

I am having some issues understanding the proper way to decode frames to play a video at normal speed without caching all decoded frames (memory explosion):

If I just use decode, then the frames are sometimes processed later. For instance, in my experience in Chromium, the first frame is never processed and I need to send at least 2 frames to start the processing.
On the other hand, if I use flush, then, not only I need to be sure to restart on a key frame (that is weird I think) which means that I must decode around 250 frames to reach the next key frame + store them in memory, but, more importantly, this creates a really significant slow down. In my experimentation, just removing .flush made a really choppy output waaaay more fluid.

So it seems like the only solution to have proper efficient decoding is to use only .decode and put "enough data" to be sure that the frames are processed… but this "enough" is not specified in the spec (my understanding of the spec is that enough = 1, but from my experience, at least Chromium does not follow this as already mentionned).

padenot commented 7 months ago

This repo is the home of the Web Codecs specification, and it is usually preferred that question go on Stack Overflow or other forum dedicated to questions and more easily searchable by others. That said, this is also searchable by others, so I'll answer here anyway.

This can generally depend on three things (maybe more?):

the video itself: depending on the codec and what parameter has been set during encoding, it can be that decoding frame n requires decoding frame n+1 and n-1 (both ways, past and future), causing the need for multiple packets to be sent to the decoder to get the first video frame back. Obviously, Web Codecs and browser implementors have little control over this.
the decoder itself: it can be software, vendored into the browser implementation (typically, vp8/9, av1, the royalty free codecs). The browser vendor might be able to tune it, for example, by lowering the number of threads used for decoding. It can also be software, in the OS (the browser vendor has no control over how it performs). It can also be hardware, using the OS APIs, and there are lots of different hardware with different characteristics out there, both on mobile and desktop. All this can influence the number of input packets one must submit to get the first video frame back, and sometimes there's very little control. As an example, the H264 hardware decoder on Window has dozens of frames of latency (hundreds of milliseconds worth of video content)
whether low-latency decoding has been requested -- this generally lowers the number of input packets needed

All that to say: there is no general answer.

Flushing needing a key frame to restart is fairly normal, I'm not entirely sure why you say it's weird. flush() is to be called at the end of the video, or when seeking, not during general playback.

To have proper efficient decoding, you send input as much as you can, and you wait to receive the first output, in which case you queue an input packet again. This generally allows saturating the underlying decoder implementation.

If you can't send input anymore, decodeQueueSize starts growing, you can wait for a "dequeue" event to be fired, this is the internal codec implementation telling you that it has more slots in its queue to produce more frames.

We (the specification editors, helped by other contributors) have written various sample apps using Web Codes, with various codecs and scenario: https://github.com/w3c/webcodecs/tree/main/samples is the source, hopefully clear and commented enough (let us know if not!), deployed at https://webcodecs-samples.netlify.app/ (not on gh pages because we need a couple headers to be set for SharedArrayBuffer). As far as I'm aware, the samples work on all browsers implementing Web Codecs on all platforms, as much as possible (codec implementation / feature implementation is sometimes incomplete and will be more complete in the next few months -- they certainly work in Chromium and Safari and most of them work in Firefox with our work in progress patches).

Also, most of the above applies to video decoding generally and not only Web Codecs, it would be the same with e.g. ffmpeg/VideoToolbox/MediaCodec/wmf/pick your media framework.

tobiasBora commented 7 months ago

Thanks a lot for your detailed answer, I was not aware that frame might need frames in the future, this might explain my issue indeed. But knowing that Windows has an even worse delay is a bit scary, I was thinking that the interface would be more uniform, abstracted by the browser. Is there a safe number of frames to decode in advance? I was queuing 250 frames (between 2 key frames), but I guess it is too much? (actually I want to be able to play backward, that's why I need this) But when I read the examples you gave, seems like they use 3 ^^

flush() is to be called at the end of the video, or when seeking, not during general playback.

Oh, that's good to know, it was not obvious from the docs I read (mostly mozilla). I was seeing it as a simple "wait until the frame is received", good to learn it is not. I was thinking it is weird to require a sync frame after (if you can decode, why do can't you restart from the last frame?), but I guess it is to be sure that people do not flush frames that need a frame in the future to be decoded.

We […] have written various sample apps using Web Codecs

Oh, last time I checked I could find this one but it only plays as fast as possible, not in real time… but https://webcodecs-samples.netlify.app/audio-video-player/audio_video_player.html is exactly what I needed, it will be really useful. Thanks a lot, I have some stuff to study now!

tobiasBora commented 7 months ago

Actually, I have a question about:

    while (this.frameBuffer.length < FRAME_BUFFER_TARGET_SIZE &&
            this.decoder.decodeQueueSize < FRAME_BUFFER_TARGET_SIZE) {
      let chunk = await this.demuxer.getNextChunk();
      this.decoder.decode(chunk);
    }

My understanding is that this tries to saturate the decoder by sending decode messages until FRAME_BUFFER_TARGET_SIZE frames are decoded AND the queue has size FRAME_BUFFER_TARGET_SIZE. But what happens if the decoder is like really fast and does not saturate? My understanding is that it will create a huge list this.frameBuffer, but decodeQueueSize will stay around 0 or 1… which might end up in a memory crash. Is it just that the decoder is never that fast so it is not a problem in practice?

sandersdan commented 7 months ago

The loop will exit if either either FRAME_BUFFER_TARGET_SIZE outputs are ready or FRAME_BUFFER_TARGET_SIZE inputs are pending.

tobiasBora commented 7 months ago

Arg, stupid me, thanks, time to sleep.

tobiasBora commented 7 months ago

Oh, but now I don't understand anymore why it is supposed to work if windows needs at least 10 decoded messages to start, as this value is hardcoded to FRAME_BUFFER_TARGET_SIZE = 3;. So if the decoder saturates directly, it will have 3 messages, so not enough to output a first frame no?

sandersdan commented 7 months ago

The decoder will consume inputs, decreasing the decodeQueueSize, even while it does not produce output.

The exception is when the number of outstanding (non-closed) outputs exceeds the decoder's limit, in which case decoding will stall. This limit varies isn't known in general.

tobiasBora commented 7 months ago

Oh, so the sum of inputs and decodeQueueSize is not invariant… interesting. So this is made possible thanks to setTimeout(this.fillFrameBuffer.bind(this), 0); that will basically loop, and add stuff to the queue if things have been consumed without producing inputs. Interesting, thanks!

sandersdan commented 7 months ago

w3c / webcodecs

After how many decode should the codec process the frames? #753