rsxdalv / tts-generation-webui

TTS Generation Web UI (Bark, MusicGen + AudioGen, Tortoise, RVC, Vocos, Demucs, SeamlessM4T, MAGNet, StyleTTS2, MMS)
https://rsxdalv.github.io/tts-generation-webui/
MIT License
1.46k stars 160 forks source link

Add Stable Audio, please! #319

Open mykeehu opened 2 weeks ago

mykeehu commented 2 weeks ago

Please add Stable Audio to the options, if you please! Thank you very much in advance!

https://github.com/Stability-AI/stable-audio-tools

And model here: https://huggingface.co/stabilityai/stable-audio-open-1.0

rsxdalv commented 2 weeks ago

Hi, thanks for requesting this! I have been procrastinating with it actually. One question - such a model would require a huggingface account and a login to be used, since this https://huggingface.co/stabilityai/stable-audio-open-1.0 cannot be automatically downloaded. Would you be ok with that?

Please respond as this is a matter that could really determine whether or not people use it.

mykeehu commented 2 weeks ago

I don't have a problem downloading the model this way, maybe you could ask for the login to download it? So those who have it can use it, those who don't can't. I don't know why it's tied to a license, but I've seen a video of it making quite good sound effects, so after the login the model would be downloaded.

chlowden commented 1 week ago

I'd be interested in trying this out too, please.

ke1ne commented 1 week ago

Hi, thanks for requesting this! I have been procrastinating with it actually. One question - such a model would require a huggingface account and a login to be used, since this https://huggingface.co/stabilityai/stable-audio-open-1.0 cannot be automatically downloaded. Would you be ok with that?

Please respond as this is a matter that could really determine whether or not people use it.

For instance, I'm ok with it. Thanks!

dairydaddy commented 1 week ago

a hearty same from I

On Thu, Jun 20, 2024 at 10:31 AM Christopher Lowden < @.***> wrote:

I'd be interested in trying this out too, please.

— Reply to this email directly, view it on GitHub https://github.com/rsxdalv/tts-generation-webui/issues/319#issuecomment-2180858641, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMCXCYQLNAQEDMUOODLQOWDZILRS5AVCNFSM6AAAAABJCW7LPOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBQHA2TQNRUGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

chlowden commented 1 week ago

I've already downloaded the checkpoint. I presume that those who are enjoying your interface are the sort of people who already have a huggingface account.

rsxdalv commented 3 days ago

Stable audio has been added but is causing some problems so it might be added-removed a few times until it's 'stable'.

rsxdalv commented 3 days ago

Also, I just want to clarify - with extensive research - stable audio is not a 'stable diffusion 1.5' moment because it has a restrictive, potentially dangerous license (which might be legally unenforceable or impossible to defend in court; it's the very same infamous SD3 license) and I saw comments about Facebook's (notably similarly non-commercially licensed) AudioGen/MusicGen performing similarly.

My biggest issue so far is that running the 'official' inference code results in ~14gb RAM usage, where due to memory management my 24 gb RAM & 24 gb VRAM system would often just fail.

That being said, I really appreciate receiving information about what people want to try and see.

chlowden commented 3 days ago

I concur on the VRAM issue. I often saturate my RTX 3090 with 24GB of RAM using MusicGen. I have not been able to test MultiBandDiffusion due to VRAM saturation. I have seen that python will not release the VRAM it takes up so it blocks the GPU. I have to restart the machine to liberate the VRAM. If Stable Audio is even worse than MusicGen, it does make it probematic to test for me.

rsxdalv commented 3 days ago

Restarting the webui should be enough. Additionally, after I fix the bugs arising from adding this new model, I can spend more time on 'unload model' buttons throughout the UI; however, there will always be some leftovers that aren't unloaded. As for Stable Audio - generating a 47 second or a 1 second clip seems to use the same amount of VRAM unless they somehow can fix it all will do it themselves. Honestly there's multiple improvements on the model itself that are waiting to be done by somebody, perhaps they are hoping the community will do it.

chlowden commented 3 days ago

And as we are talking of other models ... maybe people are interested in ... Toucan TTS with 7000 languages https://github.com/DigitalPhonetics/IMS-Toucan

rsxdalv commented 2 days ago

And as we are talking of other models ... maybe people are interested in ... Toucan TTS with 7000 languages https://github.com/DigitalPhonetics/IMS-Toucan

For this project it seems decent but could be hard to handle if it means everyone has to install espeak.