Closed G-Wang closed 6 years ago
Hi @G-Wang , I have just added some speech English speech samples here. I've trained the model as best as possible, but is seems that the results are not so great. This is probably because I used only character-level features. I know that in the original papers from Google, they say that there is not need for G2P. However, this might be the case for their training data, not LJ. The Romanian samples sound quite good now, without any lexicon. This is not the case for English. I have just added a G2P module, which I trained on CMUDICT and I have restarted training. For the statistics:
It takes about one month to traina good model. On a GTX1080TI, you pass over the entire data in LJ in about 1/2 day for the Encoder and 2 days with the Vocoder. This is with the parameters (batch-size) I used in the docs.
If you are having issues playing the files on GitHub, I suggest you clone the REPO and play them locally. I'm having this issue with my browser.
Also, another contributor is working on adding "unsupervised" text-features. The idea is to extract dependencies between characters/words from a large text corpus (in our case Wikipedia dumps) and add them as external features to the model. This should enhance prosody modeling, but we made no tests yet.
thanks for the info and great work again. Will give the models a spin.
Hello,
Thank you for the wonderful repository.
I read that you're currently training on LJSpeech dataset for english TTS.
Do you have any updates on audio samples?
Also would you be able to provide some rough training stats (number of GPU used, hours need per pass through data, etc).
Thanks again for the awesome repository and open-source effort.