Closed roholazandie closed 4 years ago
I do not know good quality Male English TTS corpus.
But recently I made GST-Taco2 using VCTK dataset, which is multi-speaker English model.
You can listen to the sample of GST-Taco2 + PWG in wav_pwg/
.
https://drive.google.com/drive/folders/1MbjtiO9MjAv-ClFrgBqE2ddAhGoTioq2?usp=sharing
Not sure the quality is acceptable for you but it is worthwhile to try.
You can train it via ESPnet2 espnet/egs2/vctk/tts1
.
I have not so much recoding experiment but in general, you need to care the following points:
For fine-tuning on pretrained model you can get off with much less data. 2 hours is ideal, although I've gotten away with 30, 10 minutes and even 70 seconds of data in the case of FastSpeech2.
Thanks for the answer. So the wav_pwg
is the samples from dataset and wav
are the outputs of the model?
No. Both are model outputs. wav is GST-Taco2 + Grrifin-Lim and wav_pwg is GST-Taco2 + PWG.
Sorry to bug you again, the quality of wav_pwg
seems reasonable to me. But as you mentioned it is a multispeaker. How does the model decide to output of a certain voice? Is there any control code for that?
@roholazandie The speaker characteristics can be controlled by style embedding. The style embedding is the weighted sum of global style embedding. Basically, the weight is decided by query audio. We give a text and query audio that is a speech of the target speaker. See the detail in the original paper https://arxiv.org/abs/1803.09017.
In GST-Taco2 + PWG, does PWG support multi-speaker? Or it needs fine-tune for each speaker? In my experiments, training PWG with multi-speaker dataset will decrease the similarity of the target speaker.
Please check the samples of VCTK / JNAS / LibriTTS. https://github.com/kan-bayashi/ParallelWaveGAN#results Basically, it works with multi-speaker dataset. The necessity of fine-tuning depends on the target level of quality.
Hello, Could you please share the checkpoint for the male text2speech GST-Taco2 + PWG used to get wav_pwg
?
You can get from https://github.com/espnet/espnet_model_zoo.
We are researchers at the University of Denver and are working on a text to speech task for a robot assistant and we need a male voice for it. We have tried to find a MALE voice model/corpus to train or use but we couldn't find any. We have the following questions:
1-Is there any good quality text-to-speech model (in ESPNet or outside of it) or a voice male corpus that we can use to train both the model and the vocoder with good quality (we want to generate voice to sound natural)?
2- If nothing with a good quality exists, can you help us with how we can collect and create a quality corpus of a male voice actor for text-to-speech tasks? What would be the ideal hours/sampling rate and other parameters? Is there any guideline on making such a corpus? We specifically want it to be good enough for building a vocoder and the model.