nomadkaraoke / python-audio-separator

Easy to use stem (e.g. instrumental/vocals) separation from CLI or as a python package, using a variety of amazing pre-trained models (primarily from UVR)
MIT License
477 stars 82 forks source link

Is it normal that 3 minute track takes 7 minutes to separate (Apple Silicon, no GPU)? #106

Closed caner-cetin closed 2 months ago

caner-cetin commented 2 months ago

First of all, thanks for this wonderful project, I cannot describe with words that how useful it is for me, and how clean it can extract the vocals, but I have a question. Is it normal that a 3 minute 1 second track takes 7 minutes to separate?

(split) audio-splitter ➤ python main.py                                                                                                                                                                                
2024-08-29 14:56:43,147 - INFO - separator - Separator version 0.18.0 instantiating with output_dir: None, output_format: WAV
2024-08-29 14:56:43,147 - INFO - separator - Output directory not specified. Using current working directory.
2024-08-29 14:56:43,147 - INFO - separator - Operating System: Darwin Darwin Kernel Version 23.5.0: Wed May  1 20:19:05 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8112
2024-08-29 14:56:43,154 - INFO - separator - System: Darwin Node: caners-MacBook-Pro.local Release: 23.5.0 Machine: arm64 Proc: arm
2024-08-29 14:56:43,154 - INFO - separator - Python Version: 3.9.19
2024-08-29 14:56:43,154 - INFO - separator - PyTorch Version: 2.4.0
2024-08-29 14:56:43,236 - INFO - separator - FFmpeg installed: ffmpeg version 7.0.2 Copyright (c) 2000-2024 the FFmpeg developers
2024-08-29 14:56:43,238 - INFO - separator - ONNX Runtime CPU package installed with version: 1.19.0
2024-08-29 14:56:43,245 - INFO - separator - Apple Silicon MPS/CoreML is available in Torch and processor is ARM, setting Torch device to MPS
2024-08-29 14:56:43,245 - INFO - separator - ONNXruntime has CoreMLExecutionProvider available, enabling acceleration
2024-08-29 14:56:43,246 - INFO - separator - Loading model model_bs_roformer_ep_317_sdr_12.9755.ckpt...
2024-08-29 14:56:46,014 - INFO - mdxc_separator - MDXC Separator initialisation complete
2024-08-29 14:56:46,014 - INFO - separator - Load model duration: 00:00:02
2024-08-29 14:56:46,014 - INFO - separator - Starting separation process for audio_file_path: 2023 - Rat Wars/09. ASHAMED.mp3
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [07:15<00:00, 18.95s/it]
2024-08-29 15:04:02,702 - INFO - mdxc_separator - Saving Instrumental stem to 09. ASHAMED_(Instrumental)_model_bs_roformer_ep_317_sdr_12.wav...
2024-08-29 15:04:02,762 - INFO - mdxc_separator - Saving Vocals stem to 09. ASHAMED_(Vocals)_model_bs_roformer_ep_317_sdr_12.wav...
2024-08-29 15:04:03,005 - INFO - common_separator - Clearing input audio file paths, sources and stems...
(split) audio-splitter ➤  neofetch                                                                                                                                                                                                                               
                    'c.          canercetin@caners-MacBook-Pro.local 
                 ,xNMM.          ----------------------------------- 
               .OMMMMo           OS: macOS 14.5 23F79 arm64 
               OMMM0,            Host: Mac14,7 
     .;loddo:' loolloddol;.      Kernel: 23.5.0 
   cKMMMMMMMMMMNWMMMMMMMMMM0:    Uptime: 3 days, 21 hours, 15 mins 
 .KMMMMMMMMMMMMMMMMMMMMMMMWd.    Packages: 3 (port), 287 (brew) 
 XMMMMMMMMMMMMMMMMMMMMMMMX.      Shell: zsh 5.9 
;MMMMMMMMMMMMMMMMMMMMMMMM:       Resolution: 1680x1050 
:MMMMMMMMMMMMMMMMMMMMMMMM:       DE: Aqua 
.MMMMMMMMMMMMMMMMMMMMMMMMX.      WM: Quartz Compositor 
 kMMMMMMMMMMMMMMMMMMMMMMMMWd.    WM Theme: Blue (Dark) 
 .XMMMMMMMMMMMMMMMMMMMMMMMMMMk   Terminal: kitty 
  .XMMMMMMMMMMMMMMMMMMMMMMMMK.   CPU: Apple M2 
    kMMMMMMMMMMMMMMMMMMMMMMd     GPU: Apple M2 
     ;KMMMMMMMWXXWMMMMMMMk.      Memory: 3500MiB / 16384MiB 
       .cooc,.    .,coo:.
import os

from audio_separator.separator import Separator

# Initialize the Separator class (with optional configuration properties, below)
separator = Separator()
# Load a machine learning model (if unspecified, defaults to 'model_mel_band_roformer_ep_3005_sdr_11.4360.ckpt')
separator.load_model("model_bs_roformer_ep_317_sdr_12.9755.ckpt")
# Perform the separation on specific audio files without reloading the model
for root, dirs, files in os.walk("."):
    path = root.split(os.sep)
    for file in files:
        if file.endswith(".jpg") is False:
            separator.separate(os.path.join(root, file))

I dont know if this is related to Mac / OSX, but process takes up my entire system resources, which is a good thing that it can utilize M2 at its best. But still, is 7 minutes normal?

beveradb commented 2 months ago

Hey @caner-cetin , thanks for the kind words, glad it useful for you!

No, that isn't normal 😅

My machine is similar to yours (Macbook Pro with M3 Max)

(audio-separator) ➜  ~ neofetch
                    'c.
                 ,xNMM.
               .OMMMMo
               OMMM0,
     .;loddo:' loolloddol;.
   cKMMMMMMMMMMNWMMMMMMMMMM0:    andrew@AndrewBeveridgeMBPM3.local
 .KMMMMMMMMMMMMMMMMMMMMMMMWd.    ---------------------------------
 XMMMMMMMMMMMMMMMMMMMMMMMX.      OS: macOS 14.5 23F79 arm64
;MMMMMMMMMMMMMMMMMMMMMMMM:       Host: Mac15,10
:MMMMMMMMMMMMMMMMMMMMMMMM:       Kernel: 23.5.0
.MMMMMMMMMMMMMMMMMMMMMMMMX.      Uptime: 2 days, 12 hours, 58 mins
 kMMMMMMMMMMMMMMMMMMMMMMMMWd.    Packages: 214 (brew)
 .XMMMMMMMMMMMMMMMMMMMMMMMMMMk   Shell: zsh 5.9
  .XMMMMMMMMMMMMMMMMMMMMMMMMK.   Resolution: 1512x982
    kMMMMMMMMMMMMMMMMMMMMMMd     DE: Aqua
     ;KMMMMMMMWXXWMMMMMMMk.      WM: Quartz Compositor
       .cooc,.    .,coo:.        WM Theme: Blue (Dark)
                                 Terminal: iTerm2
                                 Terminal Font: Monaco 12
                                 CPU: Apple M3 Max
                                 GPU: Apple M3 Max
                                 Memory: 4810MiB / 36864MiB

I made you a short screencast video demonstrating separation of a popular song (Duration: 00:03:06) on my machine: https://youtu.be/ZXZwXMDe5vM

This includes showing how I verify inference is using my GPU (using the Activity Monitor GPU History graph).

On my machine, separating that 3 minute track takes the following amounts of time, depending on which model I choose:

So as you can see, it's definitely not normal for it to take 7 minutes for a 3 minute track!

If you want to test with the same input file and commands as me, here's the file I used in the tests above: https://www.dropbox.com/scl/fi/k4tbc79ggzfcn509qwpji/sabrina-please-test.flac?rlkey=ufnkns7vjnhuqsic225rbzdbx&dl=0

My recommendation to you would be to:

Good luck! -Andrew

caner-cetin commented 2 months ago

Thanks for the quick response Andrew, I bumped the python to 3.12.0, bumped library, checked cpu and gpu history, which, it maximizes the entire gpu history during runtime

resim

Yet it still takes 5 more minutes to process the same FLAC file provided, even in same model model_bs_roformer_ep_317_sdr_12.9755.ckpt maybe 2022 M2 is significantly weaker than M3 Max?

beveradb commented 2 months ago

That is still kinda surprising! Can you try the other models I tried to compare runtimes for the other architectures? If they're all slower by a similar amount I'd be inclined to agree with you (but still surprised)

caner-cetin commented 2 months ago

Tried by the way, forgot to add that. They are significantly slower than your benchmarks. At least 1 minute slower on fastest, like, 10 minute slower for an entire album. I am surprised too. At least the awesome quality overall compensates the speed lol.

caner-cetin commented 2 months ago

Researched for a few days, tried to disable Metal Processing Shader (which was horrible, iterations per second dipped to 18 second), tried to build Torch myself, but best I could get was 12 seconds per iteration. meanwhile ye olde shitbox 1650ti could do 5.50 seconds per iteration out of the box. i dont think there is too much to add, i will just assume Torch runs horrible on M2, aand will close it here. thanks for the reply again.

caner-cetin commented 1 month ago

I dont know what have changed (maybe due to model?) but after latest main update and model mel_band_roformer_karaoke_aufr33_viperx_sdr_10 I am getting 5 seconds per iteration. Which is a huge improvement over 18 seconds. No quality lost, everything sounds crystal clear, again, I dont know what have changed, but had to come here and thank you friend

beveradb commented 1 month ago

Glad to hear! 😄