Open jmtatsch opened 1 month ago
I am considering it, so far I've heard it's not as good as xtts, but haven't tried it myself yet.
Imho its far superior to xtts - less robotic and more emotional. https://www.youtube.com/watch?v=Ghc8cJdQyKQ Only catch is its non-commercial license.
I see a major reason to implement support for Fish: it seems to support quantization.
I have an old GPU with 8G of RAM so every byte matters to me and I really struggled to find any good information on how to quantize XTTS. I conclude that it's not something that can be relied upon so seeing this PR that adds quantization support for Fish Speech makes me very interested!
PS: what's up with deepspeed for XTTS btw? I see that it takes a pip install deepspeed
. If you can't support in the official image could you give me some pointers to use it on my side? XTTS is pretty slow for me, too much for interactivity.
I see a major reason to implement support for Fish: it seems to support quantization.
I have an old GPU with 8G of RAM so every byte matters to me and I really struggled to find any good information on how to quantize XTTS. I conclude that it's not something that can be relied upon so seeing this PR that adds quantization support for Fish Speech makes me very interested!
That's a great point, thanks for that.
Re: deepspeed, can you start a new issue or discussion? it's worth its own space, I know it would help low VRAM folks a lot but it's a bit complex, especially for windows.
Hi, I took a quick look at fish audio again. I'm sharing this to make it easier to give it a try!
Their reference is there https://speech.fish.audio/ but I ended up doing my thing:
git clone https://github.com/fishaudio/fish-speech/
cd fish-speech
Then create docker-compose.yml with content:
services:
fish-speech:
image: fishaudio/fish-speech:latest-dev # avoid building it
volumes:
- ./:/exp
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
network_mode: host # to access their gradio
docker compose up
then go to localhost:7860 to check out their gradio.
My takeaway is that its of super high quality, and quite fast. Hard to quantify but I never saw it take more than 2.2G of VRAM, whereas xtts often took all my 8Go (might actually be a bug come to think of it?!). Fish on my old gpu seems to take 60s to generate 30s of audio. But have done zero optimization. I don't really understand how to enable quantization. There seems to be some args to setup --compile
and --half
but I don't have the time right now.
I think to go further I would need to compile it from the repo to modify the entry point to the other python gradio scripts. There are some related to quantization directly.
Have you seen the new fish speech model https://github.com/fishaudio/fish-speech ? Wonderful voice cloning and intonation performance. Would you consider supporting it?