Add Support for Spatial Audio

vi-dot-cpp commented 5 years ago

Modern day scenarios, based on data & partner asks we have analyzed, need support for querying Spatial Audio capabilities. Per the current specification, the only way to query for Dolby Atmos support would be with a specific codec MIME type in AudioConfiguration. However, this is not granular enough – a device may be able to decode Dolby Atmos (since it is compatible with Dolby Digital Plus), but not be able to render spatial audio.

We face similar considerations to those of HDR capabilities, so we propose to expose them in buckets in similar fashion to #118:

1. Define SpatialCapability Enum

enum SpatialCapability { 
    “DolbyAtmos”, 
    “DTSX” 
};

2. Add SpatialCapability Enum to AudioConfiguration

dictionary AudioConfiguration {
    … 
    SpatialCapability spatialCapability; 
};

Team: @scottlow @gurpreetv @isuru-c-p @vi-dot-cpp from Microsoft

chcunningham commented 5 years ago

However, this is not granular enough – a device may be able to decode Dolby Atmos (since it is compatible with Dolby Digital Plus), but not be able to render spatial audio.

Help me understand this bit. If you use the mime type and we say supported=true, you may or may not have all the speakers you need. But if you don't have them, we'll just downmix. Is that not sufficient?

isuru-c-p commented 5 years ago

However, this is not granular enough – a device may be able to decode Dolby Atmos (since it is compatible with Dolby Digital Plus), but not be able to render spatial audio.

Help me understand this bit. If you use the mime type and we say supported=true, you may or may not have all the speakers you need. But if you don't have them, we'll just downmix. Is that not sufficient?

In many cases, for a given title, streaming sites will have alternate audio tracks with differing levels of fidelity (2 ch vs 7.1ch vs Atmos). Being able to determine whether a client supports spatial audio rendering will allow the site to select the audio track which provides the highest possible fidelity for the client, without sacrificing bandwidth for features which the client can't take advantage of.

chcunningham commented 5 years ago

What if sites use MC for the basic decode support question, but separately check the number of available output channels using WebAudio's maxChannelCount?

isuru-c-p commented 5 years ago

What if sites use MC for the basic decode support question, but separately check the number of available output channels using WebAudio's maxChannelCount?

Unfortunately, the maxChannelCount is not sufficient for expressing whether a system supports spatial rendering of DolbyAtmos or DTSX. For example, Dolby Atmos can be rendered on systems connected to headphones with 2 channels ("Dolby Atmos for Headphones").

chcunningham commented 5 years ago

Understood. Related noob question: if you have 7 channels but its not really an Atmos system, is it generally still desirable to receive atmos? I'm not sure how backward compatible this is with traditional surround sound.

In #118, @jernoble wrote:

I don't like adding vendor-specific names to specifications, so I'm hesitant to enshrine "DolbyVision" into Media Capabilities.

We've got similar challenge here. Is it possible to describe these proprietary technologies in terms of open standards? (Seems unlikely, but I'm not familiar enough to rule it out).

isuru-c-p commented 5 years ago

if you have 7 channels but its not really an Atmos system, is it generally still desirable to receive atmos?

Receiving Atmos would be preferable in this case over say a 2-channel stream. However, if the site also has a non-Atmos 7.1 channel stream, that may be preferable.

We've got similar challenge here. Is it possible to describe these proprietary technologies in terms of open standards? (Seems unlikely, but I'm not familiar enough to rule it out).

Unfortunately, this is not possible in this case, as unlike with HDR, the spatial metadata formats are not standardized.

dalecurtis commented 5 years ago

Generally if a system can decode Atmos/DTS:X you'll always want to serve that stream. So I'm not sure I follow why you need the additional differentiation for rendering. Can you elaborate more on why it's insufficient to just test for decoder presence?

isuru-c-p commented 5 years ago

Generally if a system can decode Atmos/DTS:X you'll always want to serve that stream. So I'm not sure I follow why you need the additional differentiation for rendering. Can you elaborate more on why it's insufficient to just test for decoder presence?

Here's an example scenario:

A site has two alternate audio streams for a piece of content:
- AAC 2 channels
- Dolby Digital Plus (E-AC3) with Dolby Atmos
A client has the ability to decode both the AAC and Dolby Digital codecs (and perhaps the client also has a Dolby Digital decoder which is capable of parsing the Atmos side channel data). However, the client does not have the ability to render audio spatially (the client has 2 speaker channels and does not support Dolby Atmos with headphones).

In this scenario, even though the client can decode the Atmos stream, it does not have the ability to take advantage of the extra channels + spatial metadata in the stream. As a result, selecting the Atmos stream would result in higher network bandwidth, without any benefits for the user.

jernoble commented 5 years ago

I think that's a different argument than the equivalent one for HDR. In the HDR case, serving HDR content to a client whose display (edit:) cannot render it (/edit) will result in a worse experience than if they were served SDR. With Dolby Atmos, the only negative affect to serving a client who can decode it, but can't take advantage of spatialization would be "higher network bandwidth". Does that meet the criteria for adding new Web API?

isuru-c-p commented 5 years ago

In the HDR case, serving HDR content to a client whose display (edit:) cannot render it (/edit) will result in a worse experience than if they were served SDR.

Note that this is not necessarily correct - a number of HDR-aware clients without HDR capable displays will accept HDR video and perform the necessary tone-mapping and color space conversion to ensure that it is displayed with a similar quality to SDR content. The major difference when compared to the spatial audio case is that the perf impact (of tone-mapping especially) is not negligible.

With Dolby Atmos, the only negative affect to serving a client who can decode it, but can't take advantage of spatialization would be "higher network bandwidth". Does that meet the criteria for adding new Web API?

The explainer doc does specifically call out exposing the quality of experience for the user:

Output capabilities In addition to not addressing successful playback, current APIs give no indication of the quality of the experience that reaches the user. For example, whether a 5.1 audio track will be better than (or even as good as) a stereo audio track or whether the display supports a given bit depth, color space, or brightness.

ragalvan commented 5 years ago

We face similar considerations to those of HDR capabilities, so we propose to expose them in buckets in similar fashion to #118:

Define SpatialCapability Enum enum SpatialCapability { “DolbyAtmos”, “DTSX” };

Add SpatialCapability Enum to AudioConfiguration dictionary AudioConfiguration { … SpatialCapability spatialCapability; };

I would recommend making SpatialCapability a Boolean. Since the contentType will determine the codec. For example ContentType of 'video/mp4; codec="ec-3"' with a spacialcapabilities of true means DolbyAtmos.

isuru-c-p commented 5 years ago

I would recommend making SpatialCapability a Boolean. Since the contentType will determine the codec. For example ContentType of 'video/mp4; codec="ec-3"' with a spacialcapabilities of true means DolbyAtmos.

That's a good suggestion and I think that approach is cleaner since each spatial audio format (i.e Dolby Atmos or DTS:X) is only applicable to a specific set of audio codecs. This would also address the feedback around not including vendor-specific names in the API.

The trade-off with that approach however, is that we will not be able to expose support for multiple spatial audio formats for a specific audio codec (e.g if a new spatial audio format is introduced for an audio codec which is not backwards compatible with the existing format). I think that is an acceptable trade-off however (if needed, a new codec MIME type which also specifies the new spatial audio format could be introduced).

I'll update my current PR.

GurpreetV commented 5 years ago

Not exposing support for multiple spatial audio formats for a specific audio codec is acceptable given we have no perceivable scenario around it for now. Plus your suggestion is absolutely workable around adding a new codec MIME type if needed.

mrstux commented 5 years ago

Understood. Related noob question: if you have 7 channels but its not really an Atmos system, is it generally still desirable to receive atmos? I'm not sure how backward compatible this is with traditional surround sound.

As a player author dealing with these issues,

Although Atmos will decode when only 'eac3' is supported, because it is backwards compatible, that may not be preferable in reality, and it comes down to what subjective quality tests determine.

If a stereo aac stream, an eac3 7.1 stream and an Atmos stream is available for selection, but Atmos is being decoded using only the base 7.1 layer, then we have a need to be able to make a client side decision to select the eac3 non-atmos stream, instead of the Atmos stream.

In this exact case, we would want to select the Atmos stream if the playback "device" is going to be rendering the audio as Atmos, or if it will be transmitting it to some other downstream receiver which is going to render the audio as Atmos, or at least, potentially.

And the "potential downstream atmos rendering" issue is actually another problem, ie we may want to handle this case differently too.

isuru-c-p commented 5 years ago

@chcunningham, @jernoble - I've pushed an update to my PR which addresses the concerns around including vendor-specific names in the API (instead of an enum with vendor specific names, a single boolean member has been added to the AudioConfiguration for spatial rendering). Please take a look when you get a chance.

chcunningham commented 5 years ago

Looks pretty good. I left some minor feedback.

GurpreetV commented 5 years ago

@jernoble can you recommend the comment text that you were proposing in our sync up?

GurpreetV commented 5 years ago

@jernoble does the final change look good to you? Waiting on you for the sign off before I ping Chris & Mounir.

jernoble commented 5 years ago

LGTM.

GurpreetV commented 5 years ago

@chcunningham, @mounirlamouri since @jernoble has signed off, can we proceed to get this in if this looks ok to you? If not, do let us know what your feedback is so we can address it.

vi-dot-cpp commented 5 years ago

Thanks everyone. Closing this issue because #123 merged.

GurpreetV commented 5 years ago

thank you very much everyone!!

w3c / media-capabilities

Add Support for Spatial Audio #120

1. Define SpatialCapability Enum

2. Add SpatialCapability Enum to AudioConfiguration