microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.22k stars 114 forks source link

Using SpeechT5 Large for TTS #51

Open imranmaj opened 1 year ago

imranmaj commented 1 year ago

Hello, thank you so much for providing these models and code along with all the documentation. The HuggingFace integration is very helpful for people like me whose specialty is not ML :) I tried out the TTS model available on HuggingFace and the results are very good, but I'm curious what the difference would be like using the larger SpeechT5 model.

My goal is to prepare the SpeechT5 Large model (60k hrs Libri-Light + LibriSpeech LM Dataset) for TTS in the same way that the smaller model on HuggingFace is tuned for TTS. I'm a little confused though on how the training was done for the smaller model in order to prepare it for TTS. I looked at the manifest and it says: "speecht5_tts.pt are reimplemented Text-to-Speech fine-tuning on the released manifest but with a smaller batch size or max updates (Ensure the manifest is ok)." Does this mean that the SpeechT5 for TTS model was completely retrained from scratch with different batch size/max updates, or was it fine-tuned from the SpeechT5 base model (960 hrs LibriSpeech + LibriSpeech LM Dataset)?

The manifest also says: "This manifest is an attempt to recreate the Text-to-Speech recipe used for training SpeechT5. This manifest was constructed using LibriTTS clean datasets, including train-clean-100 and train-clean-360 for training, dev-clean for validation, and test-clean for evaluation." Does this mean that it was trained from scratch using 100 + 360 = 460 hours of LibriTTS data, or was it fine-tuned on those 460 hours of data?

Thank you!