Closed padenot closed 3 years ago
Agreed, AudioBuffer
is limiting. AudioFrame
was designed to minimize dependence on it (it only needs to be created when accessed) so we can add new ways to use the audio data in a backwards-compatible way.
We don't have a buffer pool API to hook this into, but there are other parts of WebCodecs that could benefit from such a thing, and I expect we'll work on that in v2.
Would you prefer it was changed to a getter method rather than an attribute?
As discussed with @chcunningham, this is expected to be more important and useful for authors than using AudioBuffer
, and this would be in addition to AudioBuffer
, not instead, modulo the design point mentioned in https://github.com/WebAudio/web-audio-api-v2/issues/119#issuecomment-808330658.
This would remove the dependency on Web Audio API V2, and allow this to be self-contained. I'd be open for an implementation to ship with this and not the AudioBuffer
variant, if the API allows for feature detection.
The general idea is to pass ArrayBuffer
s for input and output, detaching the buffers, and then giving back both buffers on output, to allow authors to implement their own recycling if they want. This is not a standard pattern (yet) but it should be, since it's the closest thing to having good memory access patterns until there is read only memory.
We might be able to do real-zero copy for WASM for the output if we do it correctly, but it means the API will have to diverge even more from what is usually being done (essentially passing in the Memory
object, an offset, and a length).
This issue has some overlap with a parallel discussion in https://github.com/w3c/webcodecs/pull/162#discussion_r622261962
TL;DR - AudioBuffer forces some snapshotting behavior ("acquire the content") which is not super intuitive.
What if we do something like this
] interface AudioFrame {
constructor(AudioFrameInit init);
readonly attribute long long timestamp; // microseconds
// Deprecated. Will make a lazy copy.
readonly attribute AudioBuffer buffer;
readonly attribute unsigned long length; // in sample-frames
readonly attribute double duration; // in seconds
readonly attribute float sampleRate; // in sample-frames per second
// Channel access
readonly attribute unsigned long numberOfChannels;
void copyFromChannel(
Float32Array destination,
unsigned long channelNumber,
optional unsigned long bufferOffset = 0);
// Creates a AudioFrame, which needs to be independently closed.
AudioFrame clone();
void close();
};
Basically, hoist up the immutable AudioBuffer methods, deprecate AudioBuffer. This fixes some of our issues while saving the matter of transfering a BYOB for later.
The above needs sampleFormat
at first: planarfloat32 (we can bikeshed on the name)
discussed whether to bake "planar" or "interleaved" into single enum that simultaneously describes the format (float32)... no strong feeling. I lean toward doing it in combined enum (for rough symmetry w/ videoframe pixel format)
linear -vs- / alaw / ulaw: resolved: lets only have linear in "frames". treat alaw/ulaw as "encoded" - would be acceptable in chunks (we'll decode to linear)
re: sample format, I propose we borrow the naming conventions used in ffmpeg, but drop the AV_SAMPLE_FMT prefix (analagous to the "I420" pixel format used in VideoFrame).
Something like this could be a good starting list:
U8, ///< unsigned 8 bits
S16, ///< signed 16 bits
S32, ///< signed 32 bits
FLT, ///< float
S16P, ///< signed 16 bits, planar
S32P, ///< signed 32 bits, planar
FLTP, ///< float, planar
These are the formats Chromium seems to recognized from ffmpeg. LMK if this missing something obvious or if one of these really rare. My first PR will just add FLTP to keep it simple (matching what we already have w/ AudioBuffer).
I was going to ask whether S32
was in fact used to represent 24-bits integer audio samples (that are quite common), but your link answers that (if we're doing it the same way ffmpeg
does it, which I think is sensible, for the same reasons: there is no 24-bits type in WASM either, and nobody wants to do computations by hand).
resolved: lets only have linear in "frames". treat alaw/ulaw as "encoded" - would be acceptable in chunks (we'll decode to linear)
to clarify with an example: ulaw
would decode to 16-bits integer so that the dynamic range is not lost.
Small comment on the proposed interface above:
Once buffer is removed, there will be no way to check if a frame is closed from JS. There should perhaps be an attribute indicating the frame is closed, and that calling copyFromChannel will throw.
there will be no way to check if a frame is closed from JS
isn't the "classic" way to check this is buffer.length == 0
? I reckon it's not explicit per se, but it's what authors to today to check whether a buffer has been detached.
there will be no way to check if a frame is closed from JS
isn't the "classic" way to check this is
buffer.length == 0
? I reckon it's not explicit per se, but it's what authors to today to check whether a buffer has been detached.
Yes. But if we remove buffer, in favor of copyFromChannel(), as per Chris' comment:
// Deprecated. Will make a lazy copy.
readonly attribute AudioBuffer buffer;
We won't be able to check this anymore. Furthermore, the act of checking whether a frame is closed or not could force the potential lazy copy.
For reference, I dug up the ECMAScript spec on detachment https://tc39.es/ecma262/#sec-detacharraybuffer
WDYT of trying to follow the TypedArray style by similarly zeroing out analogous fields of AudioData (length, duration, sampleRate, numberOfChannels, ...)? We can make this a clear signal by spec'ing that it is otherwise invalid to construct an AudioData from an empty BufferSource.
For those reading along, the PR landed with unresolved issues (done to unblock approved dependent PRs). I'll send a follow up shortly.
Thanks I opened https://github.com/w3c/webcodecs/issues/223 for following up with a link to the context from Dale, you and myself.
Closing this one, as follow up work now tracked in #223
Quite a few programs or library don't really use the Web Audio API with the native nodes etc., and rely on either the
AudioWorklet
or even aScriptProcessorNode
to play their audio. All the DSP happens in user-code (probably WASM), because that's code that is already written, and the Web is "just" another target in this context.It seems like we could add an output mode for our audio decoders that would write to a user-provided buffer, to minimize copy.
This would also sidestep the issue where (for now) the
AudioBuffer
object is always planar f32, and quite a few authors would like integers for memory footprint reasons, and also because that's what codec libraries frequently output, no need for a useless conversion.Forcing the use of
AudioBuffer
s (that have quite a few constraints) when any user of WASM probably has their own audio engine seem like a sub-optimal solution for them. This would also simplify the spec and the code I guess.