matatonic / openedai-speech

An OpenAI API compatible text to speech server using Coqui AI's xtts_v2 and/or piper tts as the backend.
GNU Affero General Public License v3.0
453 stars 58 forks source link

New Fish model #58

Open jmtatsch opened 1 month ago

jmtatsch commented 1 month ago

Have you seen the new fish speech model https://github.com/fishaudio/fish-speech ? Wonderful voice cloning and intonation performance. Would you consider supporting it?

matatonic commented 1 month ago

I am considering it, so far I've heard it's not as good as xtts, but haven't tried it myself yet.

jmtatsch commented 1 month ago

Imho its far superior to xtts - less robotic and more emotional. https://www.youtube.com/watch?v=Ghc8cJdQyKQ Only catch is its non-commercial license.

thiswillbeyourgithub commented 1 month ago

I see a major reason to implement support for Fish: it seems to support quantization.

I have an old GPU with 8G of RAM so every byte matters to me and I really struggled to find any good information on how to quantize XTTS. I conclude that it's not something that can be relied upon so seeing this PR that adds quantization support for Fish Speech makes me very interested!

PS: what's up with deepspeed for XTTS btw? I see that it takes a pip install deepspeed. If you can't support in the official image could you give me some pointers to use it on my side? XTTS is pretty slow for me, too much for interactivity.

matatonic commented 1 month ago

I see a major reason to implement support for Fish: it seems to support quantization.

I have an old GPU with 8G of RAM so every byte matters to me and I really struggled to find any good information on how to quantize XTTS. I conclude that it's not something that can be relied upon so seeing this PR that adds quantization support for Fish Speech makes me very interested!

That's a great point, thanks for that.

Re: deepspeed, can you start a new issue or discussion? it's worth its own space, I know it would help low VRAM folks a lot but it's a bit complex, especially for windows.

thiswillbeyourgithub commented 1 month ago

Also to add: they apparently support --compile for operator fusion

thiswillbeyourgithub commented 3 weeks ago

Hi, I took a quick look at fish audio again. I'm sharing this to make it easier to give it a try!

Their reference is there https://speech.fish.audio/ but I ended up doing my thing:

git clone https://github.com/fishaudio/fish-speech/
cd fish-speech

Then create docker-compose.yml with content:

    services:
      fish-speech:
        image: fishaudio/fish-speech:latest-dev  # avoid building it
        volumes:
          - ./:/exp
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: all
                  capabilities: [gpu]
        network_mode: host  # to access their gradio

docker compose up then go to localhost:7860 to check out their gradio.

My takeaway is that its of super high quality, and quite fast. Hard to quantify but I never saw it take more than 2.2G of VRAM, whereas xtts often took all my 8Go (might actually be a bug come to think of it?!). Fish on my old gpu seems to take 60s to generate 30s of audio. But have done zero optimization. I don't really understand how to enable quantization. There seems to be some args to setup --compile and --half but I don't have the time right now.

I think to go further I would need to compile it from the repo to modify the entry point to the other python gradio scripts. There are some related to quantization directly.