metavoiceio / metavoice-src

Foundational model for human-like, expressive TTS
https://themetavoice.xyz/
Apache License 2.0
3.48k stars 614 forks source link
ai deep-learning pytorch speech speech-synthesis text-to-speech tts voice-clone zero-shot-tts

MetaVoice-1B

Playground

Open In Colab

Twitter

MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities:

We’re releasing MetaVoice-1B under the Apache 2.0 license, it can be used without restrictions.

Quickstart - tl;dr

Web UI

docker-compose up -d ui && docker-compose ps && docker-compose logs -f

Server

# navigate to <URL>/docs for API definitions
docker-compose up -d server && docker-compose ps && docker-compose logs -f

Installation

Pre-requisites:

Environment setup

# install ffmpeg
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz.md5
md5sum -c ffmpeg-git-amd64-static.tar.xz.md5
tar xvf ffmpeg-git-amd64-static.tar.xz
sudo mv ffmpeg-git-*-static/ffprobe ffmpeg-git-*-static/ffmpeg /usr/local/bin/
rm -rf ffmpeg-git-*

# install rust if not installed (ensure you've restarted your terminal after installation)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Project dependencies installation

  1. Using poetry
  2. Using pip/conda

Using poetry (recommended)

# install poetry if not installed (ensure you've restarted your terminal after installation)
pipx install poetry

# disable any conda envs that might interfere with poetry's venv
conda deactivate

# if running from Linux, keyring backend can hang on `poetry install`. This prevents that.
export PYTHON_KEYRING_BACKEND=keyring.backends.fail.Keyring

# pip's dependency resolver will complain, this is temporary expected behaviour
# full inference & finetuning functionality will still be available
poetry install && poetry run pip install torch==2.2.1 torchaudio==2.2.1

Using pip/conda

NOTE 1: When raising issues, we'll ask you to try with poetry first. NOTE 2: All commands in this README use poetry by default, so you can just remove any poetry run.

pip install -r requirements.txt
pip install torch==2.2.1 torchaudio==2.2.1
pip install -e .

Usage

  1. Download it and use it anywhere (including locally) with our reference implementation
    
    # You can use `--quantisation_mode int4` or `--quantisation_mode int8` for experimental faster inference.  This will degrade the quality of the audio.
    # Note: int8 is slower than bf16/fp16 for undebugged reasons. If you want fast, try int4 which is roughly 2x faster than bf16/fp16.
    poetry run python -i fam/llm/fast_inference.py

Run e.g. of API usage within the interactive python session

tts.synthesise(text="This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.", spk_ref_path="assets/bria.mp3")

> Note: The script takes 30-90s to startup (depending on hardware). This is because we torch.compile the model for fast inference.

> On Ampere, Ada-Lovelace, and Hopper architecture GPUs, once compiled, the synthesise() API runs faster than real-time, with a Real-Time Factor (RTF) < 1.0.

2. Deploy it on any cloud (AWS/GCP/Azure), using our [inference server](serving.py) or [web UI](app.py)
```bash
# You can use `--quantisation_mode int4` or `--quantisation_mode int8` for experimental faster inference. This will degrade the quality of the audio.
# Note: int8 is slower than bf16/fp16 for undebugged reasons. If you want fast, try int4 which is roughly 2x faster than bf16/fp16.

# navigate to <URL>/docs for API definitions
poetry run python serving.py

poetry run python app.py
  1. Use it via Hugging Face
  2. Google Colab Demo

Finetuning

We support finetuning the first stage LLM (see Architecture section).

In order to finetune, we expect a "|"-delimited CSV dataset of the following format:

audio_files|captions
./data/audio.wav|./data/caption.txt

Note that we don't perform any dataset overlap checks, so ensure that your train and val datasets are disjoint.

Try it out using our sample datasets via:

poetry run finetune --train ./datasets/sample_dataset.csv --val ./datasets/sample_val_dataset.csv

Once you've trained your model, you can use it for inference via:

poetry run python -i fam/llm/fast_inference.py --first_stage_path ./my-finetuned_model.pt

Configuration

In order to set hyperparameters such as learning rate, what to freeze, etc, you can edit the finetune_params.py file.

We've got a light & optional integration with W&B that can be enabled via setting wandb_log = True & by installing the appropriate dependencies.

poetry install -E observable

Upcoming

Architecture

We predict EnCodec tokens from text, and speaker information. This is then diffused up to the waveform level, with post-processing applied to clean up the audio.

Optimizations

The model supports:

  1. KV-caching via Flash Decoding
  2. Batching (including texts of different lengths)

Contribute

Acknowledgements

We are grateful to Together.ai for their 24/7 help in marshalling our cluster. We thank the teams of AWS, GCP & Hugging Face for support with their cloud platforms.

Apologies in advance if we've missed anyone out. Please let us know if we have.