open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.5k stars 386 forks source link
audio-generation audio-synthesis audioldm audit fastspeech2 hifi-gan music-generation naturalspeech2 singing-voice-conversion speech-synthesis text-to-audio text-to-speech vall-e vits voice-conversion

Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit


Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.

The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to,

In addition to the specific generation tasks, Amphion includes several vocoders and evaluation metrics. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building large-scale datasets for speech synthesis.

🚀 News

⭐ Key Features

TTS: Text to Speech

SVC: Singing Voice Conversion

TTA: Text to Audio

Vocoder

Evaluation

Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain:

Datasets

Visualization

Amphion provides visualization tools to interactively illustrate the internal processing mechanism of classic models. This provides an invaluable resource for educational purposes and for facilitating understandable research.

Currently, Amphion supports SingVisio, a visualization tool of the diffusion model for singing voice conversion. arXiv openxlab Video

📀 Installation

Amphion can be installed through either Setup Installer or Docker Image.

Setup Installer

git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

# Install Python Environment
conda create --name amphion python=3.9.15
conda activate amphion

# Install Python Packages Dependencies
sh env.sh

Docker Image

  1. Install Docker, NVIDIA Driver, NVIDIA Container Toolkit, and CUDA.

  2. Run the following commands:

    
    git clone https://github.com/open-mmlab/Amphion.git
    cd Amphion

docker pull realamphion/amphion docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion

Mount dataset by argument `-v` is necessary when using Docker. Please refer to [Mount dataset in Docker container](egs/datasets/docker.md) and [Docker Docs](https://docs.docker.com/engine/reference/commandline/container_run/#volume) for more details.

## 🐍 Usage in Python

We detail the instructions of different tasks in the following recipes:

- [Text to Speech (TTS)](egs/tts/README.md)
- [Singing Voice Conversion (SVC)](egs/svc/README.md)
- [Text to Audio (TTA)](egs/tta/README.md)
- [Vocoder](egs/vocoder/README.md)
- [Evaluation](egs/metrics/README.md)
- [Visualization](egs/visualization/README.md)

## 👨‍💻 Contributing
We appreciate all contributions to improve Amphion. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.

## 🙏 Acknowledgement

- [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2) and [jaywalnut310's VITS](https://github.com/jaywalnut310/vits) for model architecture code.
- [lifeiteng's VALL-E](https://github.com/lifeiteng/vall-e) for training pipeline and model architecture design.
- [SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer) for semantic-distilled tokenizer design.
- [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), [ContentVec](https://github.com/auspicious3000/contentvec), and [RawNet3](https://github.com/Jungjee/RawNet) for pretrained models and inference code.
- [HiFi-GAN](https://github.com/jik876/hifi-gan) for GAN-based Vocoder's architecture design and training strategy.
- [Encodec](https://github.com/facebookresearch/encodec) for well-organized GAN Discriminator's architecture and basic blocks.
- [Latent Diffusion](https://github.com/CompVis/latent-diffusion) for model architecture design.
- [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) for preparing the MFA tools.

## ©️ License

Amphion is under the [MIT License](LICENSE). It is free for both research and commercial use cases.

## 📚 Citations

```bibtex
@inproceedings{amphion,
    author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
    title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
    booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
    year={2024}
}