Open anssiko opened 4 years ago
The Empowering Musicians and Artists using Machine Learning to Build Their Own Tools in the Browser talk by @Louismac also notes AudioWorklets as a partial solution to use cases that have strict action-response latency requirements, such as:
[...] connecting inputs from a variety of sources, running potentially computationally expensive feature extractors alongside lightweight machine learning models and generating audio and visual output, in real time, without interference.
@Louismac and @teropa, in your experience, are there known feature gaps in the core AudioWorklets API that make the API not optimal for your use cases? I've understood some of the known implementation issues in Chrome around AudioWorklets-related garbage collection have been addressed recently. @teropa made a suggestion in his talk to look into exposing inference capabilities in a AudioWorklet context, which has been noted as a possible future exploration.
The Interactive ML - Powered Music Applications on the Web talk by @teropa explains how a key design consideration in apps for musical instruments is latency between the user input (e.g. a key press on an instrument, a video input) and musical output as illustrated by the Action-Response Cycle:
This cycle must execute within ~0-20 ms for the experience to feel natural.
Real-time audio is mentioned as a very constrained capability on the web platform currently:
Particularly demanding task is generating actual audio data in the browser with ML (as opposed to generating symbolic music data with ML). Proposals mentioned for consideration that may help lower the latency in this scenario:
Another use case that involves video input (from webcam) and musical output has the following per-frame path:
Notably, the steps to get data into the model (Webcam MediaStream > Draw to Canvas > Build Pixel Tensor) take half of the time.
The bottleneck of canvas (copy rendered video frames to a canvas element, process pixels extracted from the canvas, and render the result to a canvas) was identified as an inefficient path also in the Media processing hooks for the Web talk by @tidoust.
This calls for APIs to provide better abstractions that allow feeding input data into ML models, @teropa concludes:
As a summary, the talk outlines the following areas as important:
This issue is to discuss the proposals that involve Web API surface improvements and other problematic aspects of real-time use cases that involve audio.
Looping in @padenot for AudioWorklet expertise as well as to reflect on the recent work on WebCodecs that might also help with these real-time audio use cases. Feel free to tag other folks who might be interested.