Closed yassinebelatar closed 7 months ago
What have you tried so far? Could you show me your logs from audio-separator
running against one of your test files so I can see some details about your system, e.g. whether it's using CUDA already or not?
2024-01-12 17:29:06.296 - INFO - cli - Separator version 0.13.0 beginning with input file: audio.wav 2024-01-12T17:29:08.531031317Z: 2024-01-12 17:29:08.530 - INFO - separator - Separator version 0.13.0 instantiating with output_dir: processed/hdemucs_mmi/audio, output_format: wav 2024-01-12T17:29:08.531146939Z: 2024-01-12 17:29:08.530 - DEBUG - separator - Normalization threshold set to 0.9, waveform will lowered to this max amplitude to avoid clipping. 2024-01-12T17:29:08.531162699Z: 2024-01-12 17:29:08.530 - DEBUG - separator - Denoising enabled, model will be run twice to reduce noise in output audio. 2024-01-12 17:29:08.530 - DEBUG - separator - Separation settings set: sample_rate=44100, hop_length=1024, segment_size=256, overlap=0.25, batch_size=200 2024-01-12T17:29:08.531172149Z: 2024-01-12 17:29:08.530 - INFO - separator - Checking hardware specifics to configure acceleration 2024-01-12T17:29:08.531217460Z: 2024-01-12 17:29:08.531 - INFO - separator - Operating System: Linux #1 SMP Wed Sep 6 21:10:58 UTC 2023 2024-01-12T17:29:08.531236110Z: 2024-01-12 17:29:08.531 - INFO - separator - System: Linux Node: runc Release: 5.4.254-170.358.amzn2.x86_64 Machine: x86_64 Proc: x86_64 2024-01-12 17:29:08.531 - INFO - separator - Python Version: 3.10.12 2024-01-12T17:29:08.531353242Z: 2024-01-12 17:29:08.531 - INFO - separator - ONNX Runtime GPU package installed with version: 1.17.0 2024-01-12T17:29:08.651032904Z: 2024-01-12 17:29:08.650 - DEBUG - separator - Python package: onnxruntime-silicon not installed 2024-01-12T17:29:08.770596723Z: 2024-01-12 17:29:08.770 - DEBUG - separator - Python package: onnxruntime not installed 2024-01-12T17:29:08.770704015Z: 2024-01-12 17:29:08.770 - INFO - separator - Torch package installed with version: 2.1.2 2024-01-12T17:29:08.770745585Z: 2024-01-12 17:29:08.770 - INFO - separator - Torchvision package installed with version: 0.16.2 2024-01-12T17:29:08.888439825Z: 2024-01-12 17:29:08.888 - DEBUG - separator - Python package: torchaudio not installed 2024-01-12T17:29:08.920224045Z: 2024-01-12 17:29:08.919 - INFO - separator - CUDA is available in Torch, setting Torch device to CUDA 2024-01-12T17:29:08.920288766Z: 2024-01-12 17:29:08.920 - INFO - separator - ONNXruntime has CUDAExecutionProvider available, enabling acceleration 2024-01-12T17:29:08.920300626Z: 2024-01-12 17:29:08.920 - DEBUG - separator - Apple Silicon MPS/CoreML not available in Torch installation. If you expect this to work, please see README 2024-01-12 17:29:08.920 - INFO - separator - Loading model UVR-MDX-NET-Voc_FT... 2024-01-12T17:29:08.920377557Z: 2024-01-12 17:29:08.920 - DEBUG - separator - Model path set to ./mdx/UVR-MDX-NET-Voc_FT.onnx 2024-01-12T17:29:08.923009198Z: 2024-01-12 17:29:08.922 - DEBUG - separator - Reading model settings... 2024-01-12T17:29:09.037803563Z: 2024-01-12 17:29:09.037 - DEBUG - separator - Model ./mdx/UVR-MDX-NET-Voc_FT.onnx has hash 77d07b2667ddf05b9e3175941b4454a0 2024-01-12T17:29:09.037911945Z: 2024-01-12 17:29:09.037 - DEBUG - separator - Model data path set to ./mdx/model_data.json 2024-01-12T17:29:09.039536410Z: 2024-01-12 17:29:09.039 - DEBUG - separator - Loading model data... 2024-01-12T17:29:09.043078786Z: 2024-01-12 17:29:09.042 - DEBUG - separator - Model data loaded: {'compensate': 1.021, 'mdx_dim_f_set': 3072, 'mdx_dim_t_set': 8, 'mdx_n_fft_scale_set': 7680, 'primary_stem': 'Vocals'} 2024-01-12T17:29:09.043117737Z: 2024-01-12 17:29:09.042 - DEBUG - separator - Model params: primary_stem=Vocals, secondary_stem=Instrumental 2024-01-12T17:29:09.043131557Z: 2024-01-12 17:29:09.042 - DEBUG - separator - Model params: batch_size=200, compensate=1.021, segment_size=256, dim_f=3072, dim_t=256 2024-01-12T17:29:09.043140217Z: 2024-01-12 17:29:09.043 - DEBUG - separator - Model params: n_fft=7680, hop=1024 2024-01-12T17:29:09.043241658Z: 2024-01-12 17:29:09.043 - DEBUG - separator - Loading ONNX model for inference... 2024-01-12T17:29:09.415941466Z: 2024-01-12 17:29:09.415 - DEBUG - separator - Model loaded successfully using ONNXruntime inferencing session. 2024-01-12 17:29:09.415 - DEBUG - separator - Loading model completed. 2024-01-12T17:29:09.416047958Z: 2024-01-12 17:29:09.415 - INFO - separator - Load model duration: 00:00:00 2024-01-12T17:29:09.416060088Z: 2024-01-12 17:29:09.415 - INFO - separator - Starting separation process for audio_file_path: audio.wav 2024-01-12 17:29:09.415 - DEBUG - separator - Preparing mix... 2024-01-12 17:29:09.415 - DEBUG - separator - Loading audio from file: audio.wav 2024-01-12T17:29:15.120852864Z: 2024-01-12 17:29:15.120 - DEBUG - separator - Audio loaded. Sample rate: 44100, Audio shape: (2, 21051392) 2024-01-12T17:29:15.133382271Z: 2024-01-12 17:29:15.133 - DEBUG - separator - Audio file is valid and contains data. 2024-01-12 17:29:15.133 - DEBUG - separator - Mix preparation completed. 2024-01-12T17:29:15.133393471Z: 2024-01-12 17:29:15.133 - DEBUG - separator - Normalizing mix before demixing... 2024-01-12T17:29:15.163069597Z: 2024-01-12 17:29:15.162 - DEBUG - spec_utils - Maximum peak amplitude above clipping threshold, normalizing from 1.0 to max peak 0.9. 2024-01-12T17:29:15.173140206Z: 2024-01-12 17:29:15.172 - DEBUG - separator - Starting demixing process with is_match_mix: False... 2024-01-12T17:29:15.173150636Z: 2024-01-12 17:29:15.172 - DEBUG - separator - Initializing model settings... 2024-01-12T17:29:15.181574398Z: 2024-01-12 17:29:15.181 - DEBUG - separator - Model input params: n_fft=7680 hop_length=1024 dim_f=3072 2024-01-12T17:29:15.181708220Z: 2024-01-12 17:29:15.181 - DEBUG - separator - Model settings: n_bins=3841, trim=3840, chunk_size=261120, gen_size=253440 2024-01-12T17:29:15.181720050Z: 2024-01-12 17:29:15.181 - DEBUG - separator - Original mix stored. Shape: (2, 21051392) 2024-01-12 17:29:15.181 - DEBUG - separator - Standard chunk size: 261120, Overlap: 0.25 2024-01-12T17:29:15.181728301Z: 2024-01-12 17:29:15.181 - DEBUG - separator - Generated size calculated: 253440 2024-01-12T17:29:15.219492434Z: 2024-01-12 17:29:15.219 - DEBUG - separator - Mixture prepared with padding. Mixture shape: (2, 21296640) 2024-01-12 17:29:15.219 - DEBUG - separator - Step size for processing chunks: 195840 as overlap is set to 0.25. 2024-01-12T17:29:15.219608516Z: 2024-01-12 17:29:15.219 - DEBUG - separator - Total chunks to process: 109 2024-01-12T17:29:15.219617826Z: 2024-01-12 17:29:15.219 - DEBUG - separator - Processing chunk 1/109: Start 0, End 261120 2024-01-12T17:29:15.223947964Z: 2024-01-12 17:29:15.223 - DEBUG - separator - Window applied to the chunk. 2024-01-12T17:29:15.294255549Z: /usr/local/lib/python3.10/dist-packages/audio_separator/separator/separator.py:630: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:261.) mix_part = torch.tensor([mixpart], dtype=torch.float32).to(self.device) 2024-01-12T17:29:15.303884121Z: 2024-01-12 17:29:15.303 - DEBUG - separator - Mix part split into batches. Number of batches: 1 2024-01-12T17:29:15.303990103Z: 2024-01-12 17:29:15.303 - DEBUG - separator - Processing mix_wave batch 1/1 2024-01-12T17:29:15.415926152Z: /usr/local/lib/python3.10/dist-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
Also when increasing segment_size WARNING - separator - Model converted from onnx to pytorch due to segment size not matching dim_t, processing may be slower.
Gotcha; that all looks right to me - you're using ONNX Runtime 1.17.0
so I assume you compiled it from source to get CUDA 12 support?
How long is it actually taking? In my experience, on a machine with CUDA GPU, a 4 minute track takes about 20-30 seconds to process.
If you check the full logs without debug loglevel there should be a couple of messages e.g. Load model duration
and Separation duration
It takes duration: 60.22055721282959s for 8 mins track . I have a quite large gpu the model is only using 3 - 4gb of my gpu I want to increase speed I have no issues with high gpu ussage
Also when increasing segment_size WARNING - separator - Model converted from onnx to pytorch due to segment size not matching dim_t, processing may be slower.
Yeah that's expected if you change segment size, you can see where that comes from here:
Why are you changing the segment size? That will certainly make things slower in my experience.
Out of curiosity, have you benchmarked / compared against running the same separation using UVR GUI? Most of my separation code is either identical to that project or very closely aligned, there should be minimal differences.
It takes duration: 60.22055721282959s for 8 mins track . I have a quite large gpu the model is only using 3 - 4gb of my gpu I want to increase speed I have no issues with high gpu ussage
Gotcha, that time is pretty normal.
If you want to make it faster, you'll need to dig into the code and work out some way to optimize it!
The work which @nnyj did in this fork may help! https://github.com/nnyj/python-audio-separator-live#benchmark-results
See https://github.com/karaokenerds/python-audio-separator/issues/3
PRs very welcome if you're able to improve performance :)
I was thinking about using ffmpeg to segment the file 1mins segment each (variable) then process and combine final stage ? What do you think of this approach ?
It's certainly worth trying! Is the theory that you could process those segments in parallel?
The codebase already has PyDub as a dependency: https://github.com/karaokenerds/python-audio-separator/blob/main/audio_separator/separator/separator.py#L16
Which is a wrapper around ffmpeg
and has an easy API for slicing audio into chunks:
https://github.com/jiaaro/pydub#quickstart
So you've got a bit of a headstart; though I'm not sure off the top of my head what the right approach to parallelizing would be.
Good luck, and feel free to email me if you want to schedule a pair programming call or any other knowledge transfer :)
Hi Bveradb,
Tried to do parallel ! Segmented the file to 5 sections but your app dooesnt seem to handle multi processing especially with one loaaded model ! It works with batching but not parallel.
You'd need to modify the code... that's why this is open source, you can just fork it and work on it
To be honnest am a newbie if you would just let me know which file to dig in to fix this and i'll try to ! Thank for your responsivness Really appreciate it !
All of the separation logic is in the main Separator class:
Good luck! If you haven't written much Python code before you'll probably need to do quite a bit of learning in order to get to the point where you can contribute, but there's a lot of tutorials online :)
If you want to organize a pair programming call at some point, feel free to email me with a suitable date/time and I'm happy to try and help!
I was able to get it to work ! Using threading local ! Updating def separate(self, audio_file_path) ! This way variable are stored locally for each thread to avoid racing issue while running in parallel !
This cut processing time / 3 for now I'll try to optimize and make it faster ! Btw for denoising either using true or false can't seem to have any impact is that only for me or for everyone ? Or I missed out something in call here is my code separator = Separator( denoise_enabled=True, model_file_dir="./mdx", output_single_stem="vocals", output_dir="vocals", log_level=logging.INFO, )
Nice work! Good to know it works as a proof of concept!
If you could add the actual segmenting / thread management functionality to audio-separator and allow users to enable that via an initialization parameter using the class or CLI (e.g. --threads=4
or something), I'd welcome that PR 😄
Once everything optimized I'll ping you for an update ! Just wanted to thank you for quick response and help :) Good work
FYI @yassinebelatar you may want to check out the latest version of audio-separator
(0.14.4 or above)
There's now support for newer models and VR arch models, some of which are much faster on my machine (e.g. 2_HP-UVR.pth
) and there are more parameters exposed to you which can let you control the speed of inferencing:
MDX Architecture Parameters:
--mdx_segment_size MDX_SEGMENT_SIZE larger consumes more resources, but may give better results (default: 256). Example: --mdx_segment_size=256
--mdx_overlap MDX_OVERLAP amount of overlap between prediction windows, 0.001-0.999. higher is better but slower (default: 0.25). Example: --mdx_overlap=0.25
--mdx_batch_size MDX_BATCH_SIZE larger consumes more RAM but may process slightly faster (default: 1). Example: --mdx_batch_size=4
--mdx_hop_length MDX_HOP_LENGTH usually called stride in neural networks, only change if you know what you're doing (default: 1024). Example: --mdx_hop_length=1024
VR Architecture Parameters:
--vr_batch_size VR_BATCH_SIZE number of batches to process at a time. higher = more RAM, slightly faster processing (default: 4). Example: --vr_batch_size=16
--vr_window_size VR_WINDOW_SIZE balance quality and speed. 1024 = fast but lower, 320 = slower but better quality. (default: 512). Example: --vr_window_size=320
--vr_aggression VR_AGGRESSION intensity of primary stem extraction, -100 - 100. typically 5 for vocals & instrumentals (default: 5). Example: --vr_aggression=2
--vr_enable_tta enable Test-Time-Augmentation; slow but improves quality (default: False). Example: --vr_enable_tta
--vr_high_end_process mirror the missing frequency range of the output (default: False). Example: --vr_high_end_process
--vr_enable_post_process identify leftover artifacts within vocal output; may improve separation for some songs (default: False). Example: --vr_enable_post_process
--vr_post_process_threshold VR_POST_PROCESS_THRESHOLD threshold for post_process feature: 0.1-0.3 (default: 0.2). Example: --vr_post_process_threshold=0.1
FYI @yassinebelatar there's some sample code for splitting input audio into shorter segments, separating each, and then rejoining the separated parts afterwards in this comment: https://github.com/karaokenerds/python-audio-separator/issues/44#issuecomment-1962483718
You could potentially adapt something like that to launch multiple audio-separator
processes in separate threads (or even perhaps in separate docker containers)
Also, I'd encourage you to try some of the VR arch (.pth
) models, as I find they provide equally good results to some of the MDX models in about half of the compute time. For example, 2_HP-UVR.pth
is my go-to for simple vocal/instrumental split.
I'm going to close this issue now as I think there are several options you can explore to make more efficient use of your resources, but feel free to reply in here if you want to share your progress or get any more support with this!
I have a very large gpu 80GB i want to increase speed increasing batch doesn't help at all . Thanks