What's the best way to ensure 1-in 1-out decoding for h264 video?

snosenzo commented 8 months ago

We have an application that takes individual user-provided h264 bitstream frames and wants to decode the frames as they come in ideally in a 1-in-1-out way. Similar to the priorities specified in this issue, with the notable differences that we are taking user-provided raw frames and they're in h264 (though h265 might be supported down the road). Our goal is low latency, and wide support for user-provided data. There is a caveat on the latter part in that we've made it clear to users that we don't support B-frames and non-annex-B data. However most users are using whatever frames come out of their cameras with little opportunities for intervention, this is important in the fact that it might be very difficult for our users to change the settings with which these streams are encoded.

I've tested a few examples of this data against the VideoDecoder and found very inconsistent behavior when it comes to low latency 1-in-1-out decoding. I have an example of data that uses the BASELINE profile that runs perfectly, has 1-in-1-out decoding and does exactly as we expect. I have two other examples that use the MAIN profile and do not have 1-in-1-out behavior. These streams require us to fill up the decode queue with 4 frames after each keyframe before returning the first frame passed. Then after passing the next keyframe the 4 frames left in the queue are dumped and we have to fill it up again. We have optimizeForLatency set to true, and have no preference set for hardwareAcceleration. I've attached the SPS json that we read for these examples and can provide additional data if need be. sps-BASELINE.json sps-MAIN-from-ios.json sps-MAIN.json

I have found that setting hardwareAcceleration: "prefer-software" on the decoderConfig removes this delay, but this is a suboptimal solution because we would ideally like to use hardware acceleration where possible, this limits support of decoding in our application for platforms that don't have software acceleration, and I don't think there are guarantees that prefer-software will always act this way.

I also saw this chrome bug and thought that we might try overwriting constraint_set3_flag=1 in the stream to achieve 1-in-1-out behavior but the profile_idc for these examples are 77, so this does not make the underlying frame buffer queue 0. Also not sure how supported it is to manipulate the bitstream SPS directly.

So looking for some help in answering these questions:

Is there anything that I can do from the decoder perspective to remove this latency caused by the queue outside of setting prefer-software? or is prefer-software the only way to do this?
If not what would be the minimum requirements we could put on users who are encoding these frames such that they always decode in real time from the VideoDecoder?
Is something like this possible to make consistent across VideoDecoders on all platforms (linux, mac, and windows)?

Thanks so much for reading and would really appreciate any help I can get on this.

dalecurtis commented 8 months ago

@sandersdan as our H264 expert.

I don't think it's possible to guarantee such support w/o control of the bitstream. If you don't support B-frames, you might be able to inject/rewrite the VUI field for max_num_reorder_frames to zero, but I'd defer to the experts.

Djuffin commented 8 months ago

you might be able to inject/rewrite the VUI field for max_num_reorder_frames to zero

that's what webrtc does

https://webrtc.googlesource.com/src/+/refs/heads/main/common_video/h264/sps_vui_rewriter.cc#400

padenot commented 8 months ago

Is there anything that I can do from the decoder perspective to remove this latency caused by the queue outside of setting prefer-software? or is prefer-software the only way to do this?

This brings an interesting question. If a developer sets both prefer-hardware and lowLatency, and the UA can only do either hardware, or low-latency (in software), what takes precedence? Currently, as noted, the only way for authors to know is to try. I don't think UAs can guarantee that all decoders, especially hardware, can do low-latency.

If not what would be the minimum requirements we could put on users who are encoding these frames such that they always decode in real time from the VideoDecoder?

I think that there is lots of moving parts here. I think we can find a configuration where no low-latency decoders are available, regardless of the bit stream (short of doing something unrealistic), so "always" might be a bit strong. "Frequently" might be a bit more realistic.

Is something like this possible to make consistent across VideoDecoders on all platforms (linux, mac, and windows)?

This depends on the OS, OS version, hardware (CPU/GPU/SoC), OEM, codec, browser and browser version. 100% consistency is going to be hard, and is an implementation concern anyway, so kind of off-topic, but I'd say that implementations will strive to achieve the desired behavior here, granted, it's clear what to do when faced with the problem outline in my answer to your first question.

snosenzo commented 8 months ago

Thanks so much for all the info! I think given all this info, the best thing for us to do is to make our system more resilient to decoders with latency, and then after that work on porting the webrtc vui rewriting code to typescript.

I think that there is lots of moving parts here. I think we can find a configuration where no low-latency decoders are available, regardless of the bit stream (short of doing something unrealistic), so "always" might be a bit strong. "Frequently" might be a bit more realistic.

@padenot Are there general guidelines written somewhere on what to configure when encoding an h264 bitstream that would give a best-effort at low latency decoding?

It would be nice to have some kind of API to determine whether a low latency software or hardware decoder is available and be able to choose based on that rather than trying each and seeing what works.

padenot commented 8 months ago

@padenot Are there general guidelines written somewhere on what to configure when encoding an h264 bitstream that would give a best-effort at low latency decoding?

I'm not the best person to answer this, sorry.

We might be able to extend MediaCapabilities to allow querying for support of low-latency encoding and decoding, because that also has information about power efficiency.

jyavenard commented 8 months ago

@padenot Are there general guidelines written somewhere on what to configure when encoding an h264 bitstream that would give a best-effort at low latency decoding?

latency is strongly determined by the decoder abilities and how it's been configured to start.

the WMF (Windows) decoder has a default latency of about 25+ frames, and if configure for low-latency will still be around 8 frames on Windows 8 and 10.

FFmpeg, if setup to use n-threads for decoding will have latency of n-frames.

How the videos were encoded would have zero effects on the decode-specific behaviour above.

sandersdan commented 8 months ago

Outside of decoder implementation limits, there are a few properties of the bitstream that can affect latency.

There are two steps in decoding H.264; the first produces decoded frames in decode order, and then the decoded frames sit in a buffer to be output in presentation order. The default size of the buffer is large (about 16 frames), but it can be reduced in a few ways:

It is possible to specify a bitstream_restriction, which can limit the size of the buffer (max_dec_frame_buffering) and signal the maximum time that a frame can wait around to be output (max_num_reorder_frames). This isn't something that encoders expose directly, but it is something you can check.
If frame reordering is used (usually true if there are B-frames), there is a minimum latency for reordering.
Lower levels have smaller buffers.
Different profiles set limits. For example, Baseline does not support B-frames, and constraint_set3_flag in most cases signals keyframe-only.

Chrome's hardware decoders handle reordering themselves, and can in most cases reach the limit of max_num_reorder_frames. Disabling B-frame encoding therefore is usually enough to get 1-in-1-out behavior. Chrome's software decoder is FFmpeg, and the threading is configured based on the optimizeForLatency flag.

w3c / webcodecs

What's the best way to ensure 1-in 1-out decoding for h264 video? #732