stemrollerapp / stemroller

Isolate vocals, drums, bass, and other instrumental stems from any song
https://www.stemroller.com
Other
2.64k stars 109 forks source link

GPU Support #19

Closed Arecsu closed 1 year ago

Arecsu commented 2 years ago

Hi! Great project :)

As far as I know, demucs supports GPU acceleration. I've noticed StemRoller only uses my CPU. I have an nvidia RTX 2070. Would it be possible to have at least a switch within the app to enable GPU acceleration?

Thank you!

iffyloop commented 2 years ago

That's a great question. I could theoretically implement this feature, but I don't have a GPU that's recent enough to use with PyTorch (Demucs' ML backend), so I couldn't test it myself. If you'd like, I can try to do an experimental build with GPU support and you can let me know if it works for you or not! However, it might take me a few weeks since I'm currently quite busy with other projects and don't have as much time to focus on StemRoller.

Would you like me to make a GPU-compatible build sometime and send you a message when it's ready to test?

Arecsu commented 2 years ago

oh yeah sure! I'll be happy to test a GPU build when it's ready. Just message me and I'll be there.

I've been digging throught the code thinking it was just a matter of changing some arguments when demucs is being called but seems like it's more than just that. On 19 Aug 2022, 17:53 -0300, iffyloop @.***>, wrote:

That's a great question. I could theoretically implement this feature, but I don't have a GPU that's recent enough to use with PyTorch (Demucs' ML backend), so I couldn't test it myself. If you'd like, I can try to do an experimental build with GPU support and you can let me know if it works for you or not! However, it might take me a few weeks since I'm currently quite busy with other projects and don't have as much time to focus on StemRoller. Would you like me to make a GPU-compatible build sometime and send you a message when it's ready to test? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

iffyloop commented 2 years ago

Cool. Yeah it's a little more complex than just changing StemRoller's code - actually the demucs-cxfreeze package needs to be refrozen with CUDA PyTorch instead of the CPU one. Shouldn't be too difficult but just takes some time. I'll send you a new build when I get a chance! Thanks for looking into it.

Arecsu commented 2 years ago

Hey thank you for taking the time to be specific about it! Now I know what needs to be done. I may be able to refroze it myself following your demucs-cxfeeze repo and bundling the proper CUDA libraries. Will take a look at it and get back soon On 19 Aug 2022, 18:40 -0300, iffyloop @.***>, wrote:

Cool. Yeah it's a little more complex than just changing StemRoller's code - actually the demucs-cxfreeze package needs to be refrozen with CUDA PyTorch instead of the CPU one. Shouldn't be too difficult but just takes some time. I'll send you a new build when I get a chance! Thanks for looking into it. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Arecsu commented 2 years ago

image

UPDATE: success!!! Super fast. Really fast. My CPU takes like 4-7 minutes to process a song. I have a 8-core AMD 3700x @ 4.2ghz.

And with the GPU (2070 super) it takes like 20 seconds 🐣🎉

demucs-cxfreeze

No changes to the app. Just replaced the demucs-cxfreeze 📁 folder. It's very heavy on file-size though, adding +3gb uncompressed.

If a SteamRoller release is going to be made compatible with nvidia GPUs, we should be aware of this somehow (from demucs repo):

If you want to use GPU acceleration, you will need at least 3GB of RAM on your GPU for demucs. However, about 7GB of RAM will be required if you use the default arguments. Add --segment SEGMENT to change size of each split. If you only have 3GB memory, set SEGMENT to 8 (though quality may be worse if this argument is too small). Creating an environment variable PYTORCH_NO_CUDA_MEMORY_CACHING=1 can help users with even smaller RAM such as 2GB (I separated a track that is 4 minutes but only 1.5GB is used), but this would make the separation slower.

If you do not have enough memory on your GPU, simply add -d cpu to the command line to use the CPU. With Demucs, processing time should be roughly equal to 1.5 times the duration of the track.

iffyloop commented 2 years ago

Thank you so much for testing this! I merged your PR into demucs-cxfreeze and will try to upload a new build with CUDA support by the end of the weekend. Once that's done, I'll make a corresponding new version of StemRoller with the updated demucs-cxfreeze bundle.

iffyloop commented 2 years ago

Unfortunately demucs-cxfreeze with CUDA is 2.2GB compressed to ZIP, and GitHub Releases only allows files up to 2GB, so I'm not sure exactly how I'd be able to distribute this with StemRoller. I'll look into splitting the archive into chunks or finding another place to store it, but it may be quite a while before GPU support is available as this poses a slight logistical problem.

aleph23 commented 2 years ago

The HD hog is Torch. Maybe write a script for anyone wanting Nvidia GPU support (and it IS worth it, it's way faster) to pull the version of Torch with CUDA support themselves. The CLI install is python.exe -m pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116 I know jack about Electron and how it deals with python dependencies but if you can offload Torch you'll save 2G compressed (4,502,012,008 bytes on disk size)

rdeavila commented 2 years ago

Maybe you can write a how-to on README.md to anyone interested on run StemRoller on Nvidia (like me).

iffyloop commented 2 years ago

Thanks @aleph23 and @rdeavila for the suggestions. If you think users would be comfortable with manually installing it, I can definitely add a section to the README about how to configure it. I'll see if I can release just the GPU-specific components as sort of a "patch" that could just be unzipped into the main app directory.

iffyloop commented 1 year ago

If any of you have a PyTorch CUDA compatible GPU and want to test out the latest update to the develop branch then please let me know if it works for you (and if not, what errors you see as output) and how long it takes to split with GPU enabled. I don't have a recent enough GPU on my device to test it, so it'd be helpful if I could get confirmation from someone else. Maybe @Arecsu since you seemed to have success doing it yourself earlier?

Arecsu commented 1 year ago

I can test it, sure. So far, I've cloned the develop branch and build it. Small bug here in ResultCard.svelte:

{#if status === 'processing'}
  <Button Icon={LoadingSpinnerIcon} text="Processing" disabled={true} />
{#if status === 'downloading'}
  <Button Icon={LoadingSpinnerIcon} text="Downloading" disabled={true} />
{:else if status === 'queued'}
  <Button Icon={CollectionIcon} text="Queued" disabled={true} />

Pull request → https://github.com/stemrollerapp/stemroller/pull/31

There are two #if. Second one should be :else if :p

I see in your latest commits a Github Actions script that should build a CUDA version. At least for me locally, I don't get any automatic PyTorch download during the build process. How can I manage to build one? Or is it a build available to download and test?

Edit: figured out I have to run npm run download-third-party-apps and then build it :)

Arecsu commented 1 year ago

Everything works great so far! Is it using the newest version of demucs? The whole package compressed using 7z level 9 LZMA2 is ~1.6gb. I think it can be uploaded to Github releases

iffyloop commented 1 year ago

Glad to hear it worked! Thanks so much for testing. Yes, this is using the latest Demucs with the htdemucs_ft model which should be the best. You can check the console output (when running in dev mode) to make sure it is using the demucs-cxfreeze from this repo instead of Demucs installed to your system (but that should be forced by the way the env vars are set for the child process). And the goal of compressing to 7z SFX was that now it should be uploaded to GitHub releases, so the plan is for this to be the next release!

Arecsu commented 1 year ago

Yes. It is using htdemucs_ft indeed. Looking awesome already. Thank you so much!

iffyloop commented 1 year ago

New version is out now! Download from https://github.com/stemrollerapp/stemroller/releases