nomadkaraoke / python-audio-separator

Easy to use stem (e.g. instrumental/vocals) separation from CLI or as a python package, using a variety of amazing pre-trained models (primarily from UVR)
MIT License
486 stars 82 forks source link

[Feature request] Stream audio from input device in real-time #3

Open nnyj opened 1 year ago

nnyj commented 1 year ago

I combined this amazing tool with the real-time implementation by https://github.com/facebookresearch/denoiser . Hopefully this might be useful to someone.

Proof-of-concept: https://github.com/nnyj/python-audio-separator-live

beveradb commented 1 year ago

Woah, this is super cool, thanks for sharing! What is/was your use case, out of curiosity?

I see you had to pull in the audio_separator code into your own project and make a bunch of changes to it to make sense for a live stream, which is understandable but also kinda unfortunate as it means any further improvements to this project won't be easy to pull in.

@nnyj - how would you feel if I refactored / reintegrated your code into this project, to essentially just add a live mode to audio-separator? I'd of course then add you as a maintainer of this project too so you could push your own updates / continued improvements to it. Totally fair if you'd prefer to keep your work in your own repo / a separate project, but just thought I'd ask 😄

nnyj commented 1 year ago

Hey, thanks for your interest. Feel free to re-integrate into this project, it is open source after all! 😄. Although the POC code is admittedly messy.

The idea to implement inferencing in real-time was more out of a curiousity since GPUs has been fast enough to split the stems in multiple times of real-time. A quick test using the UVR GUI, I was able to achieve 5.48x real-time during conversion with an ensemble model (MDX-NET Inst Main + Inst 3 + Kim Vocal 2).

The use case is of course endless, I'm a huge fan of instrumental music and enjoy listening to my library without having to convert everything. Anyone could also say, stream any music from online services such as youtube/spotify without needing/having the actual audio files.

A quick look at other projects show that there were similar interests/requests:

Source separation has come a long way and I found the MDX-Net models striked a good balance between inference time and audio quality. But I think there is an inherent buffer of at least 1-2 seconds required for the models to do its magic, so it may never achieve "full" realtime. For my use case, that is perfectly fine though.

SuperKogito commented 11 months ago

Hello @nnyj

you have provided a nice summary here. I am also facing the same constraint of minimum 1.5 seconds needed for e.g. spectrogram computation so that the AI has enough temporal/ contextual information to do the separation.

Does anyone here have an idea on how to overcome this? this should be possible since there are audio plugins that are "real-time". Here, the author speaks even of a low latency of 46.4 ms https://github.com/james34602/SpleeterRT/issues/8