Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
This PR introduces multi-speaker support for the current VITS model. It allows for the synthesis of speech in multiple voices and enables users to choose the specific speaker's voice that suits their preferences. To test this PR, you may follow the guidelines specified in the latest egs/tts/VITS/README.md.
🚧 Related Issues
None
👨💻 Changes Proposed
[1] Enabling multi-speaker VITS support:
Updated egs/tts/VITS/run.sh, exp_config.json and README.md to include necessary arguments and instructions for enabling multi-speaker training and inferencing in VITS
Included intersperse function in utils/data_utils.py, allowing the insertion of blanks (0) within consecutive phone IDs to regulate speaking speed.
[2] Streamlined Hi-Fi TTS dataset preprocessing:
Introduced the Hi-Fi TTS dataset structure in egs/datasets/README.md
Updated preprocessors/processor.py to accommodate the Hi-Fi TTS preprocessor
[3] Changes on VITS dataset loader:
Included metadata filter in models/tts/vits/vits_dataset.py to exclude very short segments such that frame_len < self.cfg.preprocess.segment_size // self.cfg.preprocess.hop_size
Shifted variable declaration of processed_data_dir from class VITSTestDataset(TTSTestDataset) in models/tts/vits/vits_dataset.py out of ifs condition as it has been referenced within elif cfg.preprocess.use_phone: (line 88, latest) without prior declaration.
[4] Enhance model compatibility for different accelerate versions
The Hi-fi TTS VITS checkpoint was trained on accelerate v0.25, the resulting model file is model.safetensors instead of pytorch_model.bin. To enable users to use the checkpoint successfully, models/tts/base/tts_inferece.py is modified to add another way of loading model when users' accelerate version is <0.25.
[5] Black formatting
🧑🤝🧑 Who Can Review?
@lmxue @RMSnow
🛠 TODO
Test multi-speaker VITS pipeline (preprocessing->feature extraction->training->resume training->inference for single and batch) on Hi-Fi TTS (Done)
Test single-speaker VITS pipeline (preprocessing->feature extraction->training->resume training->inference for single and batch) on LJSpeech (Done)
✅ Checklist
[ ] Code has been reviewed
[ ] Code complies with the project's code standards and best practices
[ ] Code has passed all tests
[ ] Code does not affect the normal use of existing features
[ ] Code has been commented properly
[ ] Documentation has been updated (if applicable)
[ ] Demo/checkpoint has been attached (if applicable)
✨ Description
This PR introduces multi-speaker support for the current VITS model. It allows for the synthesis of speech in multiple voices and enables users to choose the specific speaker's voice that suits their preferences. To test this PR, you may follow the guidelines specified in the latest egs/tts/VITS/README.md.
🚧 Related Issues
None
👨💻 Changes Proposed
[1] Enabling multi-speaker VITS support:
intersperse
function in utils/data_utils.py, allowing the insertion of blanks (0) within consecutive phone IDs to regulate speaking speed.[2] Streamlined Hi-Fi TTS dataset preprocessing:
[3] Changes on VITS dataset loader:
frame_len < self.cfg.preprocess.segment_size // self.cfg.preprocess.hop_size
processed_data_dir
from classVITSTestDataset(TTSTestDataset)
in models/tts/vits/vits_dataset.py out of ifs condition as it has been referenced withinelif cfg.preprocess.use_phone:
(line 88, latest) without prior declaration.[4] Enhance model compatibility for different accelerate versions
model.safetensors
instead ofpytorch_model.bin
. To enable users to use the checkpoint successfully, models/tts/base/tts_inferece.py is modified to add another way of loading model when users' accelerate version is <0.25.[5] Black formatting
🧑🤝🧑 Who Can Review?
@lmxue @RMSnow
🛠 TODO
✅ Checklist