MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities:
We’re releasing MetaVoice-1B under the Apache 2.0 license, it can be used without restrictions.
Web UI
docker-compose up -d ui && docker-compose ps && docker-compose logs -f
Server
# navigate to <URL>/docs for API definitions
docker-compose up -d server && docker-compose ps && docker-compose logs -f
Pre-requisites:
Environment setup
# install ffmpeg
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz.md5
md5sum -c ffmpeg-git-amd64-static.tar.xz.md5
tar xvf ffmpeg-git-amd64-static.tar.xz
sudo mv ffmpeg-git-*-static/ffprobe ffmpeg-git-*-static/ffmpeg /usr/local/bin/
rm -rf ffmpeg-git-*
# install rust if not installed (ensure you've restarted your terminal after installation)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# install poetry if not installed (ensure you've restarted your terminal after installation)
pipx install poetry
# disable any conda envs that might interfere with poetry's venv
conda deactivate
# if running from Linux, keyring backend can hang on `poetry install`. This prevents that.
export PYTHON_KEYRING_BACKEND=keyring.backends.fail.Keyring
# pip's dependency resolver will complain, this is temporary expected behaviour
# full inference & finetuning functionality will still be available
poetry install && poetry run pip install torch==2.2.1 torchaudio==2.2.1
NOTE 1: When raising issues, we'll ask you to try with poetry first.
NOTE 2: All commands in this README use poetry
by default, so you can just remove any poetry run
.
pip install -r requirements.txt
pip install torch==2.2.1 torchaudio==2.2.1
pip install -e .
# You can use `--quantisation_mode int4` or `--quantisation_mode int8` for experimental faster inference. This will degrade the quality of the audio.
# Note: int8 is slower than bf16/fp16 for undebugged reasons. If you want fast, try int4 which is roughly 2x faster than bf16/fp16.
poetry run python -i fam/llm/fast_inference.py
tts.synthesise(text="This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.", spk_ref_path="assets/bria.mp3")
> Note: The script takes 30-90s to startup (depending on hardware). This is because we torch.compile the model for fast inference.
> On Ampere, Ada-Lovelace, and Hopper architecture GPUs, once compiled, the synthesise() API runs faster than real-time, with a Real-Time Factor (RTF) < 1.0.
2. Deploy it on any cloud (AWS/GCP/Azure), using our [inference server](serving.py) or [web UI](app.py)
```bash
# You can use `--quantisation_mode int4` or `--quantisation_mode int8` for experimental faster inference. This will degrade the quality of the audio.
# Note: int8 is slower than bf16/fp16 for undebugged reasons. If you want fast, try int4 which is roughly 2x faster than bf16/fp16.
# navigate to <URL>/docs for API definitions
poetry run python serving.py
poetry run python app.py
We support finetuning the first stage LLM (see Architecture section).
In order to finetune, we expect a "|"-delimited CSV dataset of the following format:
audio_files|captions
./data/audio.wav|./data/caption.txt
Note that we don't perform any dataset overlap checks, so ensure that your train and val datasets are disjoint.
Try it out using our sample datasets via:
poetry run finetune --train ./datasets/sample_dataset.csv --val ./datasets/sample_val_dataset.csv
Once you've trained your model, you can use it for inference via:
poetry run python -i fam/llm/fast_inference.py --first_stage_path ./my-finetuned_model.pt
In order to set hyperparameters such as learning rate, what to freeze, etc, you can edit the finetune_params.py file.
We've got a light & optional integration with W&B that can be enabled via setting
wandb_log = True
& by installing the appropriate dependencies.
poetry install -E observable
We predict EnCodec tokens from text, and speaker information. This is then diffused up to the waveform level, with post-processing applied to clean up the audio.
The model supports:
We are grateful to Together.ai for their 24/7 help in marshalling our cluster. We thank the teams of AWS, GCP & Hugging Face for support with their cloud platforms.
Apologies in advance if we've missed anyone out. Please let us know if we have.