Multi-speaker VITS & Hi-Fi TTS dataset structure

✨ Description

This PR introduces multi-speaker support for the current VITS model. It allows for the synthesis of speech in multiple voices and enables users to choose the specific speaker's voice that suits their preferences. To test this PR, you may follow the guidelines specified in the latest egs/tts/VITS/README.md.

🚧 Related Issues

None

👨‍💻 Changes Proposed

[1] Enabling multi-speaker VITS support:

Updated egs/tts/VITS/run.sh, exp_config.json and README.md to include necessary arguments and instructions for enabling multi-speaker training and inferencing in VITS
Included intersperse function in utils/data_utils.py, allowing the insertion of blanks (0) within consecutive phone IDs to regulate speaking speed.

[2] Streamlined Hi-Fi TTS dataset preprocessing:

Introduced the Hi-Fi TTS dataset structure in egs/datasets/README.md
Updated preprocessors/processor.py to accommodate the Hi-Fi TTS preprocessor

[3] Changes on VITS dataset loader:

Included metadata filter in models/tts/vits/vits_dataset.py to exclude very short segments such that frame_len < self.cfg.preprocess.segment_size // self.cfg.preprocess.hop_size
Shifted variable declaration of processed_data_dir from class VITSTestDataset(TTSTestDataset) in models/tts/vits/vits_dataset.py out of ifs condition as it has been referenced within elif cfg.preprocess.use_phone: (line 88, latest) without prior declaration.

[4] Enhance model compatibility for different accelerate versions

The Hi-fi TTS VITS checkpoint was trained on accelerate v0.25, the resulting model file is model.safetensors instead of pytorch_model.bin. To enable users to use the checkpoint successfully, models/tts/base/tts_inferece.py is modified to add another way of loading model when users' accelerate version is <0.25.

[5] Black formatting

🧑‍🤝‍🧑 Who Can Review?

@lmxue @RMSnow

🛠 TODO

Test multi-speaker VITS pipeline (preprocessing->feature extraction->training->resume training->inference for single and batch) on Hi-Fi TTS (Done)
Test single-speaker VITS pipeline (preprocessing->feature extraction->training->resume training->inference for single and batch) on LJSpeech (Done)

✅ Checklist

[ ] Code has been reviewed
[ ] Code complies with the project's code standards and best practices
[ ] Code has passed all tests
[ ] Code does not affect the normal use of existing features
[ ] Code has been commented properly
[ ] Documentation has been updated (if applicable)
[ ] Demo/checkpoint has been attached (if applicable)

open-mmlab / Amphion