Recommendation for Training/

AliNGatGeeks commented 2 years ago

Hi Thorsten

hope you are well and thank you for you pervious responses.

I am using ESPNET for training TTS and using Tacotron2 for text2Mel + Parallelwavgan for decoding I am facing some difficulties and I would appreciate it if you can give me some comments if it is not too much trouble.

First, at the moment I am only using character tokenization. So the list of tokens I am using is generally the English alphabet and a few characters. especial German characters are being mapped to English characters. (would this be sufficient for generating good quality speech? or will this significantly reduce the quality of the generated voice?) I have difficulty using phonemes at the moment, so I am thinking about leaving conversion to phonemes for later training!

Then, as you mentioned you have 3 different quality level of wav files. I am training on all of the files... is that ok? or you recommend something else like training a subset?

Furthermore, I am training at 20250 SR (should is stick with this or change to 16k?) and finally occasionally I saw samples with some silences in the begging/end/middle of the wav files. But there is only few ones. Do you recommend removing them? Best regards

thorstenMueller commented 2 years ago

Hi @AliNGatGeeks , for training a character based model i'd recommend to use https://github.com/coqui-ai/TTS/blob/main/TTS/bin/find_unique_chars.py as it prints a list of characters that can be added to training config this should be sufficient. Training with all files is okay as @domcross optimized all recordings. Just in case you're not satified with the result you can train without the lowest quality level - and you can stay with 22k SR.

I saw samples with some silences in the begging/end/middle of the wav files. But there is only few ones. Do you recommend removing them?

Yes.

AliNGatGeeks commented 2 years ago

Thank you for your comments Will apply

thorstenMueller / Thorsten-Voice

Recommendation for Training/ #34