Configuring VideoDecoder scalability

chcunningham commented 2 years ago

Presently we have a scalabilityMode knob in the VideoEncoderConfig, but nothing to configure scalability in VideoDecoderConfig. At this point I'm not sure if we actually need anything for VideoDecoderConfig... let's explore.

Note: I'm well aware that decode support for SVC is quirky. If we can, I'd like to set that aside for a moment focus on what configuration knobs should we provide for decoders that do support SVC.

My naive model has been decoding SVC should require no special configuration. If you want to decode a frame from an enhancement layer, your only obligation would be to first decode that frame's dependencies. You should expect multiple outputs for a given temporal unit from the different layers; simply drop (close()) the outputs you don't want. As enhancement layers come and go (e.g. as bandwidth or sender cpu fluctuates), you wouldn't need to reconfigure the decoder. If packet loss breaks the dependency chain for an enhancement layer, simply stop decoding that layer until some future time when you again have all the dependencies.

I'm recently aware of a few knobs in underlying codec APIs that raise suspicion ^this model may be missing something.

libvpx has VP9_DECODE_SVC_SPATIAL_LAYER, a "control function to decode SVC stream up to the x spatial layers, where x is passed in through the control, and is 0 for base layer"
dav1d has knobs for "all_layers" (false -> just base layer) and "operating_point" for filtering to a specific an enhancement layer

For those listed, the semantics are pretty similar. Going back to the packet loss example, I guess users would adjust the configured output layer down / up as needed.

I have a number of questions for library experts:

What other SVC decoding knobs exist?
For those listed, am I right that we always have the option to say "decode all layers"?
If the answer to ^2 is "yes", this leaves the door open for us to implement the naive model I outlined at the start. Are there reasons we shouldn't do this (e.g. performance?) and instead require (or maybe just allow?) users to set a layer filter?

Adding a few folks I know. Please feel free to loop anyone. @DanilChapovalov @jzern @marco99zz @aboba @mattrwoz

chcunningham commented 2 years ago

@mhoro

jzern commented 2 years ago

@jeromejj

marco99zz commented 2 years ago

Yes you're correct that decoding SVC requires no special configuration. There are no other SVC decoding knobs, and the only one for libvpx (that you mentioned: VP9_DECODE_SVC_SPATIAL_LAYER) was only used for offline/standalone testing. In the RTC application (WebRTC) that decoder control is not used, so yes the decoder will always decode all layers.

On Tue, Nov 9, 2021 at 9:56 PM chcunningham @.***> wrote:

Presently we have a scalabilityMode knob in the VideoEncoderConfig, but nothing to configure scalability in VideoDecoderConfig. At this point I'm not sure if we actually need anything for VideoDecoderConfig... let's explore.

Note: I'm well aware that decode support for SVC is quirky. If we can, I'd like to set that aside for a moment focus on what configuration knobs should we provide for decoders that do support SVC.

My naive model has been decoding SVC should require no special configuration. If you want to decode a frame from an enhancement layer, your only obligation would be to first decode that frame's dependencies. You should expect multiple outputs for a given temporal unit from the different layers; simply drop (close()) the outputs you don't want. As enhancement layers come and go (e.g. as bandwidth or sender cpu fluctuates), you wouldn't need to reconfigure the decoder. If packet loss breaks the dependency chain for an enhancement layer, simply stop decoding that layer until some future time when you again have all the dependencies.

I'm recently aware of a few knobs in underlying codec APIs that raise suspicion ^this model may be missing something.

libvpx has VP9_DECODE_SVC_SPATIAL_LAYER https://source.chromium.org/chromium/chromium/src/+/main:third_party/libvpx/source/libvpx/vpx/vp8dx.h;drc=b6e40f8ab247ea209bcecff994386a70a4169e9d;l=117, a "control function to decode SVC stream up to the x spatial layers, where x is passed in through the control, and is 0 for base layer"

dav1d has knobs https://source.chromium.org/chromium/chromium/src/+/main:third_party/dav1d/libdav1d/include/dav1d/dav1d.h;drc=9636af88b892703ed13ce9216723c63029b6dd1a;l=65 for "all_layers" (false -> just base layer) and "operating_point" for filtering to a specific an enhancement layer

For those listed, the semantics are pretty similar. Going back to the packet loss example, I guess users would adjust the configured output layer down / up as needed.

I have a number of questions for library experts:

What other SVC decoding knobs exist?

For those listed, am I right that we always have the option to say "decode all layers"?

If the answer to ^2 is "yes", this leaves the door open for us to implement the naive model I outlined at the start. Are there reasons we shouldn't do this (e.g. performance?) and instead require (or maybe just allow?) users to set a layer filter?

Adding a few folks I know. Please feel free to loop anyone. @jzern https://github.com/jzern @marco99zz https://github.com/marco99zz @aboab https://github.com/aboab @mattrwoz https://github.com/mattrwoz

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/w3c/webcodecs/issues/399, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDDKFHXMUBZHGC2EDIYAALULICQZANCNFSM5HXA6DLA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

chcunningham commented 2 years ago

Thanks y'all.

One new concern to discuss. IIUC, AV1 has OBUs that describe the SVC layering in-band (5.8.5. Metadata scalability syntax, 5.8.6. Scalability structure syntax). Is there any potential for decoding errors if a receiver gets only a subset of the initial layers while still having in-band SVC metadata that describes the original layering? @aboba recalls some challenges around this in Chrome's AV1 webRTC impl that motivated using a decoder layer filtering API. @DanilChapovalov may recall the details.

DanilChapovalov commented 2 years ago

Metadata scalability structure OBU shouldn't be a problem. section 6.7.5 of the av1 spec notes "The scalability metadata OBU is intended for use by intermediate processing entities that may perform selective layer elimination" "If the received bitstream has been modified by an intermediate processing entity, then some of the layers and/or individual frames may be absent from the bitstream" I read it as such scenario is normal and must be supported by the decoder..

However I recall another issue in AV1: choose_operating_point function as described in section 6.4.1. It assumes all operating points in the bitstream are usable, which might not be the case when SFM filters some layers and might be an issue for _KEY kind of structures. I remember we discussed that issue during av1 rtp spec discussion, but I can't find the conclusion of that discussion. However I see that issue more like a theoretical AV1 spec issue, not generic decoder issue or implementation issue. In particular I don't recall hitting it while implementing AV1 svc in webrtc.

chcunningham commented 2 years ago

Thanks @DanilChapovalov. @aboba LMK if this sufficiently addresses your concerns. Sounds like no action needed for WC.

aboba commented 2 years ago

Agree that no action appears needed with respect to decoder SVC configuration. Can we close this?

w3c / webcodecs

Configuring VideoDecoder scalability #399