Trimming down the AudioConfiguration

chcunningham commented 3 years ago

Most of the fields in AudioConfiguration were added in the first draft of the spec. But, at least in Chromium, we don't have much use for some of the optional parameters including:

DOMString channels;
unsigned long long bitrate;
unsigned long samplerate;

In Chromium, if we support the codec profile, we will support decoding for any number of channels and any bitrate and samplerate. The codec definition may itself impose some limits, but we don't need an API to surface those (encodings that exceed those limits, if they exist, would simply be "bad content").

VideoConfiguration has similar fields (bitrate, framerate, etc...) which generally don't make/break support. But these fields are useful when making predictions about playback smoothness. The same can't be said for audio, which is cheap to decode, and always marked smooth and powerEfficient (at least in Chromium).

So, question to other implementers (@jernoble @eric-carlson @jyavenard @padenot @vi-dot-cpp @aboba @jpiesing): do you find the above AudioConfiguration inputs useful? Should we deprecate?

jyavenard commented 3 years ago

samplerate and channels are definitely used.

Samplerete in particular as on Windows the system AAC decoder only supports a narrow list of samplerate https://msdn.microsoft.com/en-us/library/windows/desktop/dd742784(v=vs.85).aspx

96kHz in particular isn't supported and will cause a decoding error and this is something seen in the wild. Seeing that chromium uses ffmpeg I'm not surprised that it's not something one would care about since it will do virtually everything.

For the channels, I've seen it use with the opus 255 files, where they would query a high number of channels where 1 channel is used for a particular audio object. This allowed to differentiate user agent supporting those files to those who don't.

I agree that bitrate is likely unused.

As such, removing support for those two fields would have a negative impact.

Jean-Yves

jpiesing commented 3 years ago

Apologies for something of a ramble .....

Something related to channels is certainly used but perhaps not exactly this .... If some content is available with both stereo and 5.1, a DASH/HLS/.... player or app needs to decide which one to use and this decision needs some information. Many decoders will be able to downmix to some extent but an expertly done offline stereo mix may give better results.

HbbTV 2.0.3 provides a 3-state value for its sort-of-ish equivalent of this;

stereo” –the terminal and any connected devices known to it only have active stereo audio outputs; any multi-channel audio sources will be downmixed. An expertly mixed stereo audio source can be expected to give a better audio experience than a multi-channel source, which the terminal would just downmix to stereo.
“multichannel” –the terminal has one or more active multi-channel audio outputs and multi-channel audio is enabled by terminal settings; multi-channel audio sources should be heard without downmixing to stereo. The terminal may be outputting multi-channel audio directly or sending it to an external device which supports the format.
“multichannel-preferred” –the conditions for “multichannel” do not apply but the terminal believes that it can produce a better audio experience if the application provides a multi-channel audio source than if the application provides an expertly mixed stereo source.

As well as this being a 3-state value, the other difference is that this answers a subtly different question - what can be output and not what can be decoded - because the answer to what can be decoded might well be anything or almost anything.

What does this mean for MC?

I think channels needs to stay.
I agree that some implementations may always return supported as true for real-world channel configurations because they will just downmix.
I fear something like smooth or powerEffficient needs to be added to give some indication about whether a downmix would likely be better or worse than "an expertly mixed" stereo.
I think there should be a way to indicate if a particular channel configuration would be output with / without downmix. This could be separate from or combined with an indication about whether a downmix would likely be better or worse than an "expertly mixed" stereo. If you want to keep the model that the fields in DecodingInfo are booleans then I guess these have to be separate.

chcunningham commented 3 years ago

samplerate and channels are definitely used.

Happily noted. Do you assume some default values when these are not provided? We could update the spec to make those defaults explicit.

I agree that bitrate is likely unused.

Cool. If others agree I'll send a PR deprecating that field.

HbbTV 2.0.3 provides a 3-state value for its sort-of-ish equivalent of this;

The WebAudio maxChannelCount should work to give the exact number of channels. It doesn't let you say "preferred". Is this for quasi-5.1 sound bars and the like?

As well as this being a 3-state value, the other difference is that this answers a subtly different question - what can be output and not what can be decoded - because the answer to what can be decoded might well be anything or almost anything.

I like to confine MC to answering questions about decoding support/perf, letting other APIs answer questions about your display and peripherals. Mostly because the "other APIs" tend to already be somewhat defined (e.g. CSSOM Screen). We let a little rendering sneak in with the spatialRendering attribute. Regrettably I don't think we considered whether that might be more at home in WebAudio, next to channels. (Aside: @jernoble @isuru-c-p - did either of you ship that yet?)

@padenot @hoch FYI

chrisn commented 3 years ago

Following from @jpiesing's feedback, how is spatial audio (e.g., 5.1) rendering handled in browsers today? Presumably checking the AudioContext destination channels will indicate whether downmixing will occur? Could the Audio Output Devices API be used to select between a device's stereo and surround outputs?

The phrasing in the MC API spec for spatialRendering seems strange - it suggests applying constraints on the implementation ("SHOULD be rendered spatially"), rather than being descriptive (as stated later: "When true, the user agent SHOULD only report this configuration as supported if it can support spatial rendering ...").

jpiesing commented 3 years ago

samplerate and channels are definitely used.

Happily noted. Do you assume some default values when these are not provided? We could update the spec to make those defaults explicit.

I agree that bitrate is likely unused.

Thinking about it a little more, some implementations might have a maxiumum bitrate - or at least a maximum bitrate they've been tested at.

Cool. If others agree I'll send a PR deprecating that field.

HbbTV 2.0.3 provides a 3-state value for its sort-of-ish equivalent of this;

The WebAudio maxChannelCount should work to give the exact number of channels. It doesn't let you say "preferred". Is this for quasi-5.1 sound bars and the like?

I believe there might be a progression of

"Real" multichannel with a sound bar or audio system actually supporting more than 2 speakers (perhaps not a full 5.1 but able to output more than just than stereo)
"Virtual" surround sound (e.g. for headphones) - https://www.bbc.co.uk/rd/publications/whitepaper250
an optimised downmix to stereo where an expert has taken care to do the best they can, this might be specific to a few channel configurations
a generic downmix to stereo whose goal is to ensure some acceptable audio is always heard on stereo speakers - it never goes quiet.

I have no idea if all of these exist in the real world. I'm not an audio expert. Hopefully I've made a sufficiently serious mistake in the above analysis than an audio expert will step in :)

As well as this being a 3-state value, the other difference is that this answers a subtly different question - what can be output and not what can be decoded - because the answer to what can be decoded might well be anything or almost anything.

I like to confine MC to answering questions about decoding support/perf, letting other APIs answer questions about your display and peripherals.

If it really is the case that (almost) any modern audio library can do some version of a downmix to stereo then the question that needs to be answered is more than just "can audio with a particular set of properties be decoded".

Mostly because the "other APIs" tend to already be somewhat defined (e.g. CSSOM Screen). We let a little rendering sneak in with the spatialRendering attribute. Regrettably I don't think we considered whether that might be more at home in WebAudio, next to channels. (Aside: @jernoble @isuru-c-p - did either of you ship that yet?)

Is an update to WebAudio in any group's charter? If not then putting a "somebody else's problem" label on this issue won't help people who just want to know whether delivering 5.1 or stereo to a particular consumer will give the better user experience.

@padenot @hoch FYI

jpiesing commented 3 years ago

@johnsim This is the issue we were discussing.

chcunningham commented 3 years ago

Sorry for the delay

how is spatial audio (e.g., 5.1) rendering handled in browsers today?

Chromium supports 5.1 and even higher. But note that this is not "spatial" as intended by the spec. Spatial refers to the modern object-based surround tech like dtsx or dolby atoms. For these, channel count was insufficient (they can run on top of a number of different channel counts).

Chromium currently does not support the codecs used in spatial rendering (e.g. EAC3-JOC). The Chromecast build of chromium does support passthrough of those codecs to audio sinks.

Presumably checking the AudioContext destination channels will indicate whether downmixing will occur?

Yep, that should work. Note that chrome does make its mixing decisions when the stream is first loaded. We try to avoid early downmixing, but there are edge cases where you can get stuck if you plug in your hw after starting the stream.

Could the Audio Output Devices API be used to select between a device's stereo and surround outputs?

I'm embarrassed to admit that I am only just now aware of that API. This breaks my model that the web assumes you have just one output device. If you can actually have N devices and switch between them on a per element basis, the MC API is a bit weird for ignoring that. In practice, you can implement the API to just return the max capability of all your devices. @jernoble is that what Safari did? In hindsight, putting spatial capabilities on the deviceinfos from that API seems cleaner. I also like that it lets you know when devices change.

then the question that needs to be answered is more than just "can audio with a particular set of properties be decoded".

I don't intend to dodge the bigger question. I'm suggesting it may already be answered by another means.

Is an update to WebAudio in any group's charter?

I don't think a recharter is needed to ammend that spec. The editors draft of WebAudio is still regularly updated.

chcunningham commented 3 years ago

re: rendering capabilities, @johnsim hosted a call with a few folks on this thread. I've made a doc to tee up discussion about possible api shapes. This starts by simply using channels from ISO 23091-3 and a spatialRendering attribute. On the call I heard a mix of opinions about the usefulness of channels, etc. I'm hoping the audio experts will weigh in with suggestions. If channels is insufficient, we should try to define some new primitives without standardizing the use of any particular proprietary audio tech.

The doc has public comment/suggestion access. Send me a note if you'd like edit access. https://docs.google.com/document/d/1to7llKOyNZxirnpCahsslKnUazQZnT2mEVfRaWbCss0/edit#

padenot commented 3 years ago

Is an update to WebAudio in any group's charter? If not then putting a "somebody else's problem" label on this issue won't help people who just want to know whether delivering 5.1 or stereo to a particular consumer will give the better user experience.

The AudioWG has rechartered, and we welcomes new issues in this tracker: https://github.com/WebAudio/web-audio-api-v2/issues. Happy to discuss there. We've discussed the problem at length in the past, but a lot more issues were of higher importance (in the sense that what was shipping in browsers was not even really specified, and it took precedence).

https://github.com/WebAudio/web-audio-api/issues/1089 has some context, and the rationale for the current setup. But in particular, note https://github.com/WebAudio/web-audio-api/issues/1089#issuecomment-268025413, and suggestions to extend the current state, from domain experts here: https://github.com/WebAudio/web-audio-api/issues/1089#issuecomment-287922213. Unfortunately, some links are now 404. It is somewhat possible to know what the default audio output device supports today, via code like this:

var ac = new AudioContext;
console.log(ac.destination.maxChannelCount); // returns 6 for a 5.1 setup

Normative references:

Script always sees the ordering specified in [0], regardless of the codec/container, it's remapped before being exposed, so that authors can process the audio without caring about the codec or the audio output device. This is a very cheap operation because the data is planar (essentially shuffling a few pointers). When the data reaches the AudioDestinationNode, the implementation takes care to output in the format the audio output device expects. This is also very cheap, because it happens at the same time as the audio is interleaved.

The proposals from @chcunningham is, to me, vastly superior to the approach we have currently in all aspects (which is just a number -> fixed layout mapping). However, referencing non-freely available documents from freely available W3C specifications is a bit annoying. This was also discussed in the links above, we can probably consult with @svgeesus who happens to be the W3C contact for the Audio Working Group and knows the options we have here.

svgeesus commented 3 years ago

Is an update to WebAudio in any group's charter?

Yes, the Audio WG (current charter),

svgeesus commented 3 years ago

However, referencing non-freely available documents from freely available W3C specifications is a bit annoying.

It is, and we prefer to avoid it where possible. But in some cases we do end up with a paywalled reference as the normative one. What we do in some cases is to add informative material such that developers without deep pockets are not at a disadvantage in terms of implementation.

chrisn commented 8 months ago

@jyavenard said:

samplerate and channels are definitely used.

I agree that bitrate is likely unused.

@chcunningham said:

Cool. If others agree I'll send a PR deprecating that field.

@jpiesing said:

some implementations might have a maxiumum bitrate - or at least a maximum bitrate they've been tested at

So on that basis I suggest that we don't deprecate channels, bitrate, or samplerate.

The main question remaining is rendering (as opposed to decoding) capabilities:

https://docs.google.com/document/d/1to7llKOyNZxirnpCahsslKnUazQZnT2mEVfRaWbCss0/edit

chrisn commented 8 months ago

Related issue #206.

w3c / media-capabilities

Trimming down the AudioConfiguration #160