jaywalnut310 / glow-tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search
MIT License
660 stars 151 forks source link

How to train a new model with dataset of diffirent language? #33

Open 41WhiteElephants opened 4 years ago

41WhiteElephants commented 4 years ago

I would like to know how to train a glow-tts 2 model for another language, using another dataset which have the same structure as LJ Speech dataset? Could you give some hints about how to train it or do a transfer learning from your pretrained models?

I have succesfully trained nvidia's tacotron2 with polish dataset as mentioned here: https://github.com/NVIDIA/tacotron2/issues/321/, do the step similar to those in tacotron2(add my language's symbols, use smaller learning rate for transfer learning) ?

rishubil commented 4 years ago

I am not a maintainer of this repository, but recently I have successfully trained Glow-TTS in diffirent language(Korean).

As you said, you can modify it in almost the same way as NVIDIA/tacotron2.

You can check my commit at https://github.com/sce-tts/glow-tts/commit/e9c4701e217df1df7c669bb2445d57e1197e1014

41WhiteElephants commented 4 years ago

Thanks for advice. I don't see warm start option, did you train it from scratch or did a transfer learning from pretrained model? And what about your changes to mel min & max values? What's it for?

rishubil commented 4 years ago

I don't see warm start option, did you train it from scratch or did a transfer learning from pretrained model?

I am considering using TTS commercially so I did not transfer learning from a pretrained model published for research use. Instead, I recorded my voice for about 3 hours and trained it from scratch.

And what about your changes to mel min & max values? What's it for?

I am using Multi-band MelGAN implemented by TensorSpeech/TensorFlowTTS instead of WaveGlow as Vocoder. In the preprocessing of TensorFlowTTS, fmin and fmax were set to 80 and 7600, so I modified the same in Glow-TTS.

41WhiteElephants commented 4 years ago

I see. So only 3 hours was enough data for training good quality voice for commercial use? Can you show any samples? Also how long and on how many GPUs you have trained it?

rishubil commented 4 years ago

I'm considering using TTS for entertainment purposes (e.g. for Twitch, Youtube live broadcasts), so I don't need very high quality audio.

Anyway, I was able to get enough results with about 3 hours of recording data I used, but I don't know if 3 hours of data would be enough for other cases.

Sample speech synthesis for a few sentences can be found at https://sce-tts.github.io/ Also, the voice dataset, pretrained model, and source code I used are all publicly available at https://sce-tts.github.io/#/license So if you are interested, you can try it yourself. (Although there is no guide written in English 😢 )

The pretrained model of Glow-TTS that I released was trained for about 2.5 days using single RTX 6000.

This is a screenshot of my Tensorboard.

image image