Audio decoder output to regular buffers and not `AudioBuffer`

padenot commented 3 years ago

Quite a few programs or library don't really use the Web Audio API with the native nodes etc., and rely on either the AudioWorklet or even a ScriptProcessorNode to play their audio. All the DSP happens in user-code (probably WASM), because that's code that is already written, and the Web is "just" another target in this context.

It seems like we could add an output mode for our audio decoders that would write to a user-provided buffer, to minimize copy.

This would also sidestep the issue where (for now) the AudioBuffer object is always planar f32, and quite a few authors would like integers for memory footprint reasons, and also because that's what codec libraries frequently output, no need for a useless conversion.

Forcing the use of AudioBuffers (that have quite a few constraints) when any user of WASM probably has their own audio engine seem like a sub-optimal solution for them. This would also simplify the spec and the code I guess.

sandersdan commented 3 years ago

Agreed, AudioBuffer is limiting. AudioFrame was designed to minimize dependence on it (it only needs to be created when accessed) so we can add new ways to use the audio data in a backwards-compatible way.

We don't have a buffer pool API to hook this into, but there are other parts of WebCodecs that could benefit from such a thing, and I expect we'll work on that in v2.

Would you prefer it was changed to a getter method rather than an attribute?

padenot commented 3 years ago

As discussed with @chcunningham, this is expected to be more important and useful for authors than using AudioBuffer, and this would be in addition to AudioBuffer, not instead, modulo the design point mentioned in https://github.com/WebAudio/web-audio-api-v2/issues/119#issuecomment-808330658.

This would remove the dependency on Web Audio API V2, and allow this to be self-contained. I'd be open for an implementation to ship with this and not the AudioBuffer variant, if the API allows for feature detection.

The general idea is to pass ArrayBuffers for input and output, detaching the buffers, and then giving back both buffers on output, to allow authors to implement their own recycling if they want. This is not a standard pattern (yet) but it should be, since it's the closest thing to having good memory access patterns until there is read only memory.

We might be able to do real-zero copy for WASM for the output if we do it correctly, but it means the API will have to diverge even more from what is usually being done (essentially passing in the Memory object, an offset, and a length).

chcunningham commented 3 years ago

This issue has some overlap with a parallel discussion in https://github.com/w3c/webcodecs/pull/162#discussion_r622261962

TL;DR - AudioBuffer forces some snapshotting behavior ("acquire the content") which is not super intuitive.

What if we do something like this

] interface AudioFrame {
    constructor(AudioFrameInit init);

    readonly attribute long long timestamp;  // microseconds

    // Deprecated. Will make a lazy copy.
    readonly attribute AudioBuffer buffer;

    readonly attribute unsigned long length; // in sample-frames
    readonly attribute double duration; // in seconds
    readonly attribute float sampleRate; // in sample-frames per second

    // Channel access
    readonly attribute unsigned long numberOfChannels;
    void copyFromChannel(
        Float32Array destination,
        unsigned long channelNumber,
        optional unsigned long bufferOffset = 0);

  // Creates a AudioFrame, which needs to be independently closed.
  AudioFrame clone();
  void close();
};

Basically, hoist up the immutable AudioBuffer methods, deprecate AudioBuffer. This fixes some of our issues while saving the matter of transfering a BYOB for later.

chcunningham commented 3 years ago

The above needs sampleFormat

at first: planarfloat32 (we can bikeshed on the name)

discussed whether to bake "planar" or "interleaved" into single enum that simultaneously describes the format (float32)... no strong feeling. I lean toward doing it in combined enum (for rough symmetry w/ videoframe pixel format)

linear -vs- / alaw / ulaw: resolved: lets only have linear in "frames". treat alaw/ulaw as "encoded" - would be acceptable in chunks (we'll decode to linear)

chcunningham commented 3 years ago

re: sample format, I propose we borrow the naming conventions used in ffmpeg, but drop the AV_SAMPLE_FMT prefix (analagous to the "I420" pixel format used in VideoFrame).

Something like this could be a good starting list:

U8,          ///< unsigned 8 bits
S16,         ///< signed 16 bits
S32,         ///< signed 32 bits
FLT,          ///< float
S16P,        ///< signed 16 bits, planar
S32P,        ///< signed 32 bits, planar
FLTP,        ///< float, planar

These are the formats Chromium seems to recognized from ffmpeg. LMK if this missing something obvious or if one of these really rare. My first PR will just add FLTP to keep it simple (matching what we already have w/ AudioBuffer).

padenot commented 3 years ago

I was going to ask whether S32 was in fact used to represent 24-bits integer audio samples (that are quite common), but your link answers that (if we're doing it the same way ffmpeg does it, which I think is sensible, for the same reasons: there is no 24-bits type in WASM either, and nobody wants to do computations by hand).

padenot commented 3 years ago

resolved: lets only have linear in "frames". treat alaw/ulaw as "encoded" - would be acceptable in chunks (we'll decode to linear)

to clarify with an example: ulaw would decode to 16-bits integer so that the dynamic range is not lost.

tguilbert-google commented 3 years ago

Small comment on the proposed interface above:

Once buffer is removed, there will be no way to check if a frame is closed from JS. There should perhaps be an attribute indicating the frame is closed, and that calling copyFromChannel will throw.

padenot commented 3 years ago

there will be no way to check if a frame is closed from JS

isn't the "classic" way to check this is buffer.length == 0 ? I reckon it's not explicit per se, but it's what authors to today to check whether a buffer has been detached.

tguilbert-google commented 3 years ago

there will be no way to check if a frame is closed from JS

isn't the "classic" way to check this is buffer.length == 0 ? I reckon it's not explicit per se, but it's what authors to today to check whether a buffer has been detached.

Yes. But if we remove buffer, in favor of copyFromChannel(), as per Chris' comment:

// Deprecated. Will make a lazy copy.
readonly attribute AudioBuffer buffer;

We won't be able to check this anymore. Furthermore, the act of checking whether a frame is closed or not could force the potential lazy copy.

chcunningham commented 3 years ago

For reference, I dug up the ECMAScript spec on detachment https://tc39.es/ecma262/#sec-detacharraybuffer

WDYT of trying to follow the TypedArray style by similarly zeroing out analogous fields of AudioData (length, duration, sampleRate, numberOfChannels, ...)? We can make this a clear signal by spec'ing that it is otherwise invalid to construct an AudioData from an empty BufferSource.

chcunningham commented 3 years ago

For those reading along, the PR landed with unresolved issues (done to unblock approved dependent PRs). I'll send a follow up shortly.

padenot commented 3 years ago

Thanks I opened https://github.com/w3c/webcodecs/issues/223 for following up with a link to the context from Dale, you and myself.

chcunningham commented 3 years ago

Closing this one, as follow up work now tracked in #223

w3c / webcodecs

Audio decoder output to regular buffers and not `AudioBuffer` #179