Action-Response Cycle bottlenecks in interactive music apps

The Interactive ML - Powered Music Applications on the Web talk by @teropa explains how a key design consideration in apps for musical instruments is latency between the user input (e.g. a key press on an instrument, a video input) and musical output as illustrated by the Action-Response Cycle:

User Input > Create Input Tensor > Upload to GPU > Run Inference > Download from GPU > Process Output Tensors > Musical Output

This cycle must execute within ~0-20 ms for the experience to feel natural.

Real-time audio is mentioned as a very constrained capability on the web platform currently:

[...] you have this task of generating 48,000 audio samples per second per channel consistently without fault. Because if you fail to do that you have an audible glitch in your outputs. So it's a very hard constraint, and it has to be deterministic because of this reason.

Particularly demanding task is generating actual audio data in the browser with ML (as opposed to generating symbolic music data with ML). Proposals mentioned for consideration that may help lower the latency in this scenario:

Inference running in WebAssembly on the CPU on the audio thread
WebNN in Worklets

Another use case that involves video input (from webcam) and musical output has the following per-frame path:

Webcam MediaStream > Draw to Canvas > Build Pixel Tensor > Upload to CPU > Run Inference > Download from GPU > Process Output Tensors > Musical Output

Notably, the steps to get data into the model (Webcam MediaStream > Draw to Canvas > Build Pixel Tensor) take half of the time.

The bottleneck of canvas (copy rendered video frames to a canvas element, process pixels extracted from the canvas, and render the result to a canvas) was identified as an inefficient path also in the Media processing hooks for the Web talk by @tidoust.

This calls for APIs to provide better abstractions that allow feeding input data into ML models, @teropa concludes:

Could there be some APIs that give me abstractions to do this in a more direct way to get immediate input into my machine learning model, without having to do quite so much work and run quite so much slow code on each frame.

As a summary, the talk outlines the following areas as important:

Low and predictable performance

Not compromising CPU/GPU needed by the UI or Audio

Inference in AudioWorklet context - Wasm or native [WebNN]?

Media integration (e.g. fast streaming inputs from MediaStream)

This issue is to discuss the proposals that involve Web API surface improvements and other problematic aspects of real-time use cases that involve audio.

Looping in @padenot for AudioWorklet expertise as well as to reflect on the recent work on WebCodecs that might also help with these real-time audio use cases. Feel free to tag other folks who might be interested.

The Empowering Musicians and Artists using Machine Learning to Build Their Own Tools in the Browser talk by @Louismac also notes AudioWorklets as a partial solution to use cases that have strict action-response latency requirements, such as:

[...] connecting inputs from a variety of sources, running potentially computationally expensive feature extractors alongside lightweight machine learning models and generating audio and visual output, in real time, without interference.

@Louismac and @teropa, in your experience, are there known feature gaps in the core AudioWorklets API that make the API not optimal for your use cases? I've understood some of the known implementation issues in Chrome around AudioWorklets-related garbage collection have been addressed recently. @teropa made a suggestion in his talk to look into exposing inference capabilities in a AudioWorklet context, which has been noted as a possible future exploration.

w3c / machine-learning-workshop

Action-Response Cycle bottlenecks in interactive music apps #97