matatonic / openedai-speech

An OpenAI API compatible text to speech server using Coqui AI's xtts_v2 and/or piper tts as the backend.
GNU Affero General Public License v3.0
362 stars 45 forks source link

OpenedAI Speech

An OpenAI API compatible text to speech server.

Full Compatibility:

Details:

If you find a better voice match for tts-1 or tts-1-hd, please let me know so I can update the defaults.

Recent Changes

Version 0.18.2, 2024-08-16

Version 0.18.1, 2024-08-15

Version 0.18.0, 2024-08-15

Version 0.17.2, 2024-07-01

Version 0.17.1, 2024-07-01

Version 0.17.0, 2024-07-01

Version 0.16.0, 2024-06-29

Version 0.15.1, 2024-06-27

Version 0.15.0, 2024-06-26

Version 0.14.1, 2024-06-26

Version 0.14.0, 2024-06-26

Version 0.13.0, 2024-06-25

Version 0.12.3, 2024-06-17

Version 0.12.2, 2024-06-16

Version 0.12.0, 2024-06-16

Version 0.11.0, 2024-05-29

Version: 0.10.1, 2024-05-05

Version: 0.10.0, 2024-04-27

Version: 0.9.0, 2024-04-23

...

Version: 0.7.3, 2024-03-20

Installation instructions

Create a speech.env environment file

Copy the sample.env to speech.env (customize if needed)

cp sample.env speech.env

Defaults

TTS_HOME=voices
HF_HOME=voices
#PRELOAD_MODEL=xtts
#PRELOAD_MODEL=xtts_v2.0.2
#EXTRA_ARGS=--log-level DEBUG --unload-timer 300
#USE_ROCM=1

Option A: Manual installation

# install curl and ffmpeg
sudo apt install curl ffmpeg
# Create & activate a new virtual environment (optional but recommended)
python -m venv .venv
source .venv/bin/activate
# Install the Python requirements
# - use requirements-rocm.txt for AMD GPU (ROCm support)
# - use requirements-min.txt for piper only (CPU only)
pip install -U -r requirements.txt
# run the server
bash startup.sh

On first run, the voice models will be downloaded automatically. This might take a while depending on your network connection.

Option B: Docker Image (recommended)

Nvidia GPU (cuda)

docker compose up

AMD GPU (ROCm support)

docker compose -f docker-compose.rocm.yml up

ARM64 (Apple M-series, Raspberry Pi)

XTTS only has CPU support here and will be very slow, you can use the Nvidia image for XTTS with CPU (slow), or use the piper only image (recommended)

CPU only, No GPU (piper only)

For a minimal docker image with only piper support (<1GB vs. 8GB).

docker compose -f docker-compose.min.yml up

Server Options

usage: speech.py [-h] [--xtts_device XTTS_DEVICE] [--preload PRELOAD] [--unload-timer UNLOAD_TIMER] [--use-deepspeed] [--no-cache-speaker] [-P PORT] [-H HOST]
                 [-L {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

OpenedAI Speech API Server

options:
  -h, --help            show this help message and exit
  --xtts_device XTTS_DEVICE
                        Set the device for the xtts model. The special value of 'none' will use piper for all models. (default: cuda)
  --preload PRELOAD     Preload a model (Ex. 'xtts' or 'xtts_v2.0.2'). By default it's loaded on first use. (default: None)
  --unload-timer UNLOAD_TIMER
                        Idle unload timer for the XTTS model in seconds, Ex. 900 for 15 minutes (default: None)
  --use-deepspeed       Use deepspeed with xtts (this option is unsupported) (default: False)
  --no-cache-speaker    Don't use the speaker wav embeddings cache (default: False)
  -P PORT, --port PORT  Server tcp port (default: 8000)
  -H HOST, --host HOST  Host to listen on, Ex. 0.0.0.0 (default: 0.0.0.0)
  -L {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the log level (default: INFO)

Sample Usage

You can use it like this:

curl http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
    "model": "tts-1",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "alloy",
    "response_format": "mp3",
    "speed": 1.0
  }' > speech.mp3

Or just like this:

curl -s http://localhost:8000/v1/audio/speech -H "Content-Type: application/json" -d '{
    "input": "The quick brown fox jumped over the lazy dog."}' > speech.mp3

Or like this example from the OpenAI Text to speech guide:

import openai

client = openai.OpenAI(
  # This part is not needed if you set these environment variables before import openai
  # export OPENAI_API_KEY=sk-11111111111
  # export OPENAI_BASE_URL=http://localhost:8000/v1
  api_key = "sk-111111111",
  base_url = "http://localhost:8000/v1",
)

with client.audio.speech.with_streaming_response.create(
  model="tts-1",
  voice="alloy",
  input="Today is a wonderful day to build something people love!"
) as response:
  response.stream_to_file("speech.mp3")

Also see the say.py sample application for an example of how to use the openai-python API.

# play the audio, requires 'pip install playsound'
python say.py -t "The quick brown fox jumped over the lazy dog." -p
# save to a file in flac format
python say.py -t "The quick brown fox jumped over the lazy dog." -m tts-1-hd -v onyx -f flac -o fox.flac

You can also try the included audio_reader.py for listening to longer text and streamed input.

Example usage:

python audio_reader.py -s 2 < LICENSE # read the software license - fast

OpenAI API Documentation and Guide

Custom Voices Howto

Piper

  1. Select the piper voice and model from the piper samples
  2. Update the config/voice_to_speaker.yaml with a new section for the voice, for example:
    ...
    tts-1:
    ryan:
    model: voices/en_US-ryan-high.onnx
    speaker: # default speaker
  3. New models will be downloaded as needed, of you can download them in advance with download_voices_tts-1.sh. For example:
    bash download_voices_tts-1.sh en_US-ryan-high

Coqui XTTS v2

Coqui XTTS v2 voice cloning can work with as little as 6 seconds of clear audio. To create a custom voice clone, you must prepare a WAV file sample of the voice.

Guidelines for preparing good sample files for Coqui XTTS v2

You can use FFmpeg to prepare your audio files, here are some examples:

# convert a multi-channel audio file to mono, set sample rate to 22050 hz, trim to 6 seconds, and output as WAV file.
ffmpeg -i input.mp3 -ac 1 -ar 22050 -t 6 -y me.wav
# use a simple noise filter to clean up audio, and select a start time start for sampling.
ffmpeg -i input.wav -af "highpass=f=200, lowpass=f=3000" -ac 1 -ar 22050 -ss 00:13:26.2 -t 6 -y me.wav
# A more complex noise reduction setup, including volume adjustment
ffmpeg -i input.mkv -af "highpass=f=200, lowpass=f=3000, volume=5, afftdn=nf=25" -ac 1 -ar 22050 -ss 00:13:26.2 -t 6 -y me.wav

Once your WAV file is prepared, save it in the /voices/ directory and update the config/voice_to_speaker.yaml file with the new file name.

For example:

...
tts-1-hd:
  me:
    model: xtts
    speaker: voices/me.wav # this could be you

You can also use a sub folder for multiple audio samples to combine small samples or to mix different samples together.

For example:

...
tts-1-hd:
  mixed:
    model: xtts
    speaker: voices/mixed

Where the voices/mixed/ folder contains multiple wav files. The total audio length is still limited to 30 seconds.

Multilingual

Multilingual cloning support was added in version 0.11.0 and is available only with the XTTS v2 model. To use multilingual voices with piper simply download a language specific voice.

Coqui XTTSv2 has support for multiple languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Hungarian (hu), Korean (ko), Japanese (ja), and Hindi (hi). When not set, an attempt will be made to automatically detect the language, falling back to English (en).

Unfortunately the OpenAI API does not support language, but you can create your own custom speaker voice and set the language for that.

1) Create the WAV file for your speaker, as in Custom Voices Howto 2) Add the voice to config/voice_to_speaker.yaml and include the correct Coqui language code for the speaker. For example:

  xunjiang:
    model: xtts
    speaker: voices/xunjiang.wav
    language: zh-cn

3) Don't remove high unicode characters in your config/pre_process_map.yaml! If you have these lines, you will need to remove them. For example:

Remove:

- - '[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U00002702-\U000027B0\U000024C2-\U0001F251]+'
  - ''

These lines were added to the config/pre_process_map.yaml config file by default before version 0.11.0:

4) Your new multi-lingual speaker voice is ready to use!

Custom Fine-Tuned Model Support

Adding a custom xtts model is simple. Here is an example of how to add a custom fine-tuned 'halo' XTTS model.

1) Save the model folder under voices/ (all 4 files are required, including the vocab.json from the model)

openedai-speech$ ls voices/halo/
config.json  vocab.json  model.pth  sample.wav

2) Add the custom voice entry under the tts-1-hd section of config/voice_to_speaker.yaml:

tts-1-hd:
...
  halo:
    model: halo # This name is required to be unique
    speaker: voices/halo/sample.wav # voice sample is required
    model_path: voices/halo

3) The model will be loaded when you access the voice for the first time (--preload doesn't work with custom models yet)

Generation Parameters

The generation of XTTSv2 voices can be fine tuned with the following options (defaults included below):

tts-1-hd:
  alloy:
    model: xtts
    speaker: voices/alloy.wav
    enable_text_splitting: True
    length_penalty: 1.0
    repetition_penalty: 10
    speed: 1.0
    temperature: 0.75
    top_k: 50
    top_p: 0.85