open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.5k stars 388 forks source link

When will the pre-trained weights of VALL-E be released? #21

Closed faceair closed 9 months ago

faceair commented 10 months ago

How much data was involved in the pre-training, and how much of it is in Chinese ? Thank you very much.

lmxue commented 10 months ago

Thanks for your attention. The checkpoint of Vall-E will be released recently. The dataset information will also be involved.

zhizhengwu commented 10 months ago

@lmxue @HeCheng0625 Please post the links to checkpoints in this thread when they are ready.

lmxue commented 10 months ago

How much data was involved in the pre-training, and how much of it is in Chinese ? Thank you very much.

Thanks for your comments. The pre-trained model of Amphion Vall-E trained on LibriTTS has been released here https://huggingface.co/amphion/valle-libritts

Welcome to test it and give any feedback.

dongngm commented 9 months ago

@lmxue I give it a try and see that the quality of generated audio is not very good, is this level of quality expected due to pretraining on relatively small dataset like LibriTTS?

sh egs/tts/VALLE/run.sh --stage 3 --gpu "0"     
--config "ckpts/tts/valle_libritts/args.json"    
--infer_expt_dir Amphion/ckpts/tts/valle_libritts     
--infer_output_dir Amphion/ckpts/tts/valle_libritts/result     
--infer_mode "single"     
--infer_text "This is a clip of generated speech with the given text from a text to speech model"       
--infer_text_prompt "Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition"     
--infer_audio_prompt ./LJSpeech-1.1/wavs/LJ001-0001.wav

https://drive.google.com/file/d/1xTb6WURcckDbV20TpsgyVRKljM9hj8kK/view?usp=sharing

zhizhengwu commented 9 months ago

@dongngm at least 10x more data is needed to have a reasonable quality.