nipponjo / tts-arabic-pytorch

TTS models for Arabic (Tacotron2, FastPitch)
69 stars 17 forks source link

Train new language #18

Open thewh1teagle opened 2 weeks ago

thewh1teagle commented 2 weeks ago

Can I use this repo for training new tts model in another language? How much hours of audio + transcripts do I need? Does the text should have diacritical signs?

nipponjo commented 2 weeks ago

You can certainly do that. In the end, the models in this repo will learn a token ids -> mel frames mapping; independent of the language. In order to train on some dataset, you will have to make a data loader that maps your text to token ids, as it is done for Modern Standard Arabic in this repo. The Arabic Speech Corpus has around 2 hours and I sampled 30-60 minutes per speaker for the multi-speaker model. In my experience it is usually better to have 10+ hours for the prosody, but that will also depend io the quality of the audio files. So far, I have only trained on diacritized text. I assume that it is possible for these models to learn the diacritization, but I haven't tried so far since I don't know any good quality dataset for that. Of course, it is possible to train a model with diacritized text, sample audio files for diacritized text, remove the diacrits and train on that.

thewh1teagle commented 2 days ago

Thanks a lot for the comment!

the models in this repo will learn a token ids -> mel frames mapping; independent of the language.

Let me know if that process sounds good:

  1. Find 10-20 hours of high quality recording of single speaker
  2. Split them into multiple files of 5-20 seconds (I can use voice activity detection for good splitting)
  3. Fix the transcriptions by converting numbers(1,2,3) to their text names
  4. Fix the transcriptions by converting symbols (such as $) to their text names
  5. Remove punctuation marks (Can I keep them? I think it's important when speaking)
  6. Removing any character that is not in the whitelisted characters (used later for token ids)
  7. Add vowel points to the transcriptions

From here I'm not sure. how do I convert the cleaned voweled text into token IDS? Do I start the training from pretrained model? but the pre trained models are mostly English no? How do I make sure that punctuation marks will be sound eg between sentences?

Can I use the repo source code for the training or it's too different when using different language?

Regarding FastPitch and HiFi-GAN. do I need to change something or it should be used identical to this repo? Also, do you think Google colab is suitable for such training?

Also, did you trained it from scratch or used pre trained English model?