Open nnyj opened 1 year ago
Woah, this is super cool, thanks for sharing! What is/was your use case, out of curiosity?
I see you had to pull in the audio_separator code into your own project and make a bunch of changes to it to make sense for a live stream, which is understandable but also kinda unfortunate as it means any further improvements to this project won't be easy to pull in.
@nnyj - how would you feel if I refactored / reintegrated your code into this project, to essentially just add a live mode to audio-separator
? I'd of course then add you as a maintainer of this project too so you could push your own updates / continued improvements to it.
Totally fair if you'd prefer to keep your work in your own repo / a separate project, but just thought I'd ask 😄
Hey, thanks for your interest. Feel free to re-integrate into this project, it is open source after all! 😄. Although the POC code is admittedly messy.
The idea to implement inferencing in real-time was more out of a curiousity since GPUs has been fast enough to split the stems in multiple times of real-time. A quick test using the UVR GUI, I was able to achieve 5.48x real-time during conversion with an ensemble model (MDX-NET Inst Main + Inst 3 + Kim Vocal 2).
The use case is of course endless, I'm a huge fan of instrumental music and enjoy listening to my library without having to convert everything. Anyone could also say, stream any music from online services such as youtube/spotify without needing/having the actual audio files.
A quick look at other projects show that there were similar interests/requests:
Source separation has come a long way and I found the MDX-Net models striked a good balance between inference time and audio quality. But I think there is an inherent buffer of at least 1-2 seconds required for the models to do its magic, so it may never achieve "full" realtime. For my use case, that is perfectly fine though.
Hello @nnyj
you have provided a nice summary here. I am also facing the same constraint of minimum 1.5 seconds needed for e.g. spectrogram computation so that the AI has enough temporal/ contextual information to do the separation.
Does anyone here have an idea on how to overcome this? this should be possible since there are audio plugins that are "real-time". Here, the author speaks even of a low latency of 46.4 ms https://github.com/james34602/SpleeterRT/issues/8
I combined this amazing tool with the real-time implementation by https://github.com/facebookresearch/denoiser . Hopefully this might be useful to someone.
Proof-of-concept: https://github.com/nnyj/python-audio-separator-live