Open thewh1teagle opened 2 weeks ago
You can certainly do that. In the end, the models in this repo will learn a token ids -> mel frames mapping; independent of the language. In order to train on some dataset, you will have to make a data loader that maps your text to token ids, as it is done for Modern Standard Arabic in this repo. The Arabic Speech Corpus has around 2 hours and I sampled 30-60 minutes per speaker for the multi-speaker model. In my experience it is usually better to have 10+ hours for the prosody, but that will also depend io the quality of the audio files. So far, I have only trained on diacritized text. I assume that it is possible for these models to learn the diacritization, but I haven't tried so far since I don't know any good quality dataset for that. Of course, it is possible to train a model with diacritized text, sample audio files for diacritized text, remove the diacrits and train on that.
Thanks a lot for the comment!
the models in this repo will learn a token ids -> mel frames mapping; independent of the language.
Let me know if that process sounds good:
From here I'm not sure. how do I convert the cleaned voweled text into token IDS? Do I start the training from pretrained model? but the pre trained models are mostly English no? How do I make sure that punctuation marks will be sound eg between sentences?
Can I use the repo source code for the training or it's too different when using different language?
Regarding FastPitch
and HiFi-GAN
. do I need to change something or it should be used identical to this repo?
Also, do you think Google colab is suitable for such training?
Also, did you trained it from scratch or used pre trained English model?
Can I use this repo for training new tts model in another language? How much hours of audio + transcripts do I need? Does the text should have diacritical signs?