Is it normal that 3 minute track takes 7 minutes to separate (Apple Silicon, no GPU)?

caner-cetin commented 2 months ago

First of all, thanks for this wonderful project, I cannot describe with words that how useful it is for me, and how clean it can extract the vocals, but I have a question. Is it normal that a 3 minute 1 second track takes 7 minutes to separate?

(split) audio-splitter ➤ python main.py                                                                                                                                                                                
2024-08-29 14:56:43,147 - INFO - separator - Separator version 0.18.0 instantiating with output_dir: None, output_format: WAV
2024-08-29 14:56:43,147 - INFO - separator - Output directory not specified. Using current working directory.
2024-08-29 14:56:43,147 - INFO - separator - Operating System: Darwin Darwin Kernel Version 23.5.0: Wed May  1 20:19:05 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8112
2024-08-29 14:56:43,154 - INFO - separator - System: Darwin Node: caners-MacBook-Pro.local Release: 23.5.0 Machine: arm64 Proc: arm
2024-08-29 14:56:43,154 - INFO - separator - Python Version: 3.9.19
2024-08-29 14:56:43,154 - INFO - separator - PyTorch Version: 2.4.0
2024-08-29 14:56:43,236 - INFO - separator - FFmpeg installed: ffmpeg version 7.0.2 Copyright (c) 2000-2024 the FFmpeg developers
2024-08-29 14:56:43,238 - INFO - separator - ONNX Runtime CPU package installed with version: 1.19.0
2024-08-29 14:56:43,245 - INFO - separator - Apple Silicon MPS/CoreML is available in Torch and processor is ARM, setting Torch device to MPS
2024-08-29 14:56:43,245 - INFO - separator - ONNXruntime has CoreMLExecutionProvider available, enabling acceleration
2024-08-29 14:56:43,246 - INFO - separator - Loading model model_bs_roformer_ep_317_sdr_12.9755.ckpt...
2024-08-29 14:56:46,014 - INFO - mdxc_separator - MDXC Separator initialisation complete
2024-08-29 14:56:46,014 - INFO - separator - Load model duration: 00:00:02
2024-08-29 14:56:46,014 - INFO - separator - Starting separation process for audio_file_path: 2023 - Rat Wars/09. ASHAMED.mp3
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [07:15<00:00, 18.95s/it]
2024-08-29 15:04:02,702 - INFO - mdxc_separator - Saving Instrumental stem to 09. ASHAMED_(Instrumental)_model_bs_roformer_ep_317_sdr_12.wav...
2024-08-29 15:04:02,762 - INFO - mdxc_separator - Saving Vocals stem to 09. ASHAMED_(Vocals)_model_bs_roformer_ep_317_sdr_12.wav...
2024-08-29 15:04:03,005 - INFO - common_separator - Clearing input audio file paths, sources and stems...

(split) audio-splitter ➤  neofetch                                                                                                                                                                                                                               
                    'c.          canercetin@caners-MacBook-Pro.local 
                 ,xNMM.          ----------------------------------- 
               .OMMMMo           OS: macOS 14.5 23F79 arm64 
               OMMM0,            Host: Mac14,7 
     .;loddo:' loolloddol;.      Kernel: 23.5.0 
   cKMMMMMMMMMMNWMMMMMMMMMM0:    Uptime: 3 days, 21 hours, 15 mins 
 .KMMMMMMMMMMMMMMMMMMMMMMMWd.    Packages: 3 (port), 287 (brew) 
 XMMMMMMMMMMMMMMMMMMMMMMMX.      Shell: zsh 5.9 
;MMMMMMMMMMMMMMMMMMMMMMMM:       Resolution: 1680x1050 
:MMMMMMMMMMMMMMMMMMMMMMMM:       DE: Aqua 
.MMMMMMMMMMMMMMMMMMMMMMMMX.      WM: Quartz Compositor 
 kMMMMMMMMMMMMMMMMMMMMMMMMWd.    WM Theme: Blue (Dark) 
 .XMMMMMMMMMMMMMMMMMMMMMMMMMMk   Terminal: kitty 
  .XMMMMMMMMMMMMMMMMMMMMMMMMK.   CPU: Apple M2 
    kMMMMMMMMMMMMMMMMMMMMMMd     GPU: Apple M2 
     ;KMMMMMMMWXXWMMMMMMMk.      Memory: 3500MiB / 16384MiB 
       .cooc,.    .,coo:.

import os

from audio_separator.separator import Separator

# Initialize the Separator class (with optional configuration properties, below)
separator = Separator()
# Load a machine learning model (if unspecified, defaults to 'model_mel_band_roformer_ep_3005_sdr_11.4360.ckpt')
separator.load_model("model_bs_roformer_ep_317_sdr_12.9755.ckpt")
# Perform the separation on specific audio files without reloading the model
for root, dirs, files in os.walk("."):
    path = root.split(os.sep)
    for file in files:
        if file.endswith(".jpg") is False:
            separator.separate(os.path.join(root, file))

I dont know if this is related to Mac / OSX, but process takes up my entire system resources, which is a good thing that it can utilize M2 at its best. But still, is 7 minutes normal?

beveradb commented 2 months ago

Hey @caner-cetin , thanks for the kind words, glad it useful for you!

No, that isn't normal 😅

My machine is similar to yours (Macbook Pro with M3 Max)

(audio-separator) ➜  ~ neofetch
                    'c.
                 ,xNMM.
               .OMMMMo
               OMMM0,
     .;loddo:' loolloddol;.
   cKMMMMMMMMMMNWMMMMMMMMMM0:    andrew@AndrewBeveridgeMBPM3.local
 .KMMMMMMMMMMMMMMMMMMMMMMMWd.    ---------------------------------
 XMMMMMMMMMMMMMMMMMMMMMMMX.      OS: macOS 14.5 23F79 arm64
;MMMMMMMMMMMMMMMMMMMMMMMM:       Host: Mac15,10
:MMMMMMMMMMMMMMMMMMMMMMMM:       Kernel: 23.5.0
.MMMMMMMMMMMMMMMMMMMMMMMMX.      Uptime: 2 days, 12 hours, 58 mins
 kMMMMMMMMMMMMMMMMMMMMMMMMWd.    Packages: 214 (brew)
 .XMMMMMMMMMMMMMMMMMMMMMMMMMMk   Shell: zsh 5.9
  .XMMMMMMMMMMMMMMMMMMMMMMMMK.   Resolution: 1512x982
    kMMMMMMMMMMMMMMMMMMMMMMd     DE: Aqua
     ;KMMMMMMMWXXWMMMMMMMk.      WM: Quartz Compositor
       .cooc,.    .,coo:.        WM Theme: Blue (Dark)
                                 Terminal: iTerm2
                                 Terminal Font: Monaco 12
                                 CPU: Apple M3 Max
                                 GPU: Apple M3 Max
                                 Memory: 4810MiB / 36864MiB

I made you a short screencast video demonstrating separation of a popular song (Duration: 00:03:06) on my machine: https://youtu.be/ZXZwXMDe5vM

This includes showing how I verify inference is using my GPU (using the Activity Monitor GPU History graph).

On my machine, separating that 3 minute track takes the following amounts of time, depending on which model I choose:

audio-separator -d -m 2_HP-UVR.pth test.flac: Separation duration: 00:00:19
audio-separator -d -m UVR_MDXNET_KARA_2.onnx test.flac: Separation duration: 00:00:28
audio-separator -d -m UVR-MDX-NET-Inst_HQ_4.onnx test.flac: Separation duration: 00:00:36
audio-separator -d -m model_bs_roformer_ep_317_sdr_12.9755.ckpt test.flac: Separation duration: 00:01:49
audio-separator -d -m MDX23C-8KFFT-InstVoc_HQ_2.ckpt test.flac: Separation duration: 00:02:37

So as you can see, it's definitely not normal for it to take 7 minutes for a 3 minute track!

If you want to test with the same input file and commands as me, here's the file I used in the tests above: https://www.dropbox.com/scl/fi/k4tbc79ggzfcn509qwpji/sabrina-please-test.flac?rlkey=ufnkns7vjnhuqsic225rbzdbx&dl=0

My recommendation to you would be to:

Upgrade to the latest version of audio-separator: 0.19.1
Use a newer version of Python, e.g. 3.11 (3.9 is very old now and not officially supported by this project)
Try different models with different architectures e.g. VR, MDX, MDXC, RoFormer (you can list all supported models with audio-separator -l)
Testing with different model architectures, verify whether your GPU is being used using Activity Monitor (see my video for example)

Good luck! -Andrew

caner-cetin commented 2 months ago

Thanks for the quick response Andrew, I bumped the python to 3.12.0, bumped library, checked cpu and gpu history, which, it maximizes the entire gpu history during runtime

Yet it still takes 5 more minutes to process the same FLAC file provided, even in same model model_bs_roformer_ep_317_sdr_12.9755.ckpt maybe 2022 M2 is significantly weaker than M3 Max?

beveradb commented 2 months ago

That is still kinda surprising! Can you try the other models I tried to compare runtimes for the other architectures? If they're all slower by a similar amount I'd be inclined to agree with you (but still surprised)

caner-cetin commented 2 months ago

Tried by the way, forgot to add that. They are significantly slower than your benchmarks. At least 1 minute slower on fastest, like, 10 minute slower for an entire album. I am surprised too. At least the awesome quality overall compensates the speed lol.

caner-cetin commented 2 months ago

Researched for a few days, tried to disable Metal Processing Shader (which was horrible, iterations per second dipped to 18 second), tried to build Torch myself, but best I could get was 12 seconds per iteration. meanwhile ye olde shitbox 1650ti could do 5.50 seconds per iteration out of the box. i dont think there is too much to add, i will just assume Torch runs horrible on M2, aand will close it here. thanks for the reply again.

caner-cetin commented 1 month ago

I dont know what have changed (maybe due to model?) but after latest main update and model mel_band_roformer_karaoke_aufr33_viperx_sdr_10 I am getting 5 seconds per iteration. Which is a huge improvement over 18 seconds. No quality lost, everything sounds crystal clear, again, I dont know what have changed, but had to come here and thank you friend

beveradb commented 1 month ago

Glad to hear! 😄

nomadkaraoke / python-audio-separator

Is it normal that 3 minute track takes 7 minutes to separate (Apple Silicon, no GPU)? #106