text2speech for male voice

kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

https://kan-bayashi.github.io/ParallelWaveGAN/

MIT License

1.55k stars 340 forks source link

text2speech for male voice #181

Closed roholazandie closed 4 years ago

roholazandie commented 4 years ago

We are researchers at the University of Denver and are working on a text to speech task for a robot assistant and we need a male voice for it. We have tried to find a MALE voice model/corpus to train or use but we couldn't find any. We have the following questions:

1-Is there any good quality text-to-speech model (in ESPNet or outside of it) or a voice male corpus that we can use to train both the model and the vocoder with good quality (we want to generate voice to sound natural)?

2- If nothing with a good quality exists, can you help us with how we can collect and create a quality corpus of a male voice actor for text-to-speech tasks? What would be the ideal hours/sampling rate and other parameters? Is there any guideline on making such a corpus? We specifically want it to be good enough for building a vocoder and the model.

kan-bayashi commented 4 years ago

I do not know good quality Male English TTS corpus. But recently I made GST-Taco2 using VCTK dataset, which is multi-speaker English model. You can listen to the sample of GST-Taco2 + PWG in wav_pwg/. https://drive.google.com/drive/folders/1MbjtiO9MjAv-ClFrgBqE2ddAhGoTioq2?usp=sharing Not sure the quality is acceptable for you but it is worthwhile to try. You can train it via ESPnet2 espnet/egs2/vctk/tts1.
I have not so much recoding experiment but in general, you need to care the following points:

Use 44.1k, 48k, or higher sampling rate.
Around 10 hours data is OK to train both text2mel and vocoder models.
Conduct recording in anechoic room or low-reverberation room.
Use phoneme-balanced sentences.
Be careful the pose position in utterances not to include unexpected pose.

ZDisket commented 4 years ago

For fine-tuning on pretrained model you can get off with much less data. 2 hours is ideal, although I've gotten away with 30, 10 minutes and even 70 seconds of data in the case of FastSpeech2.

roholazandie commented 4 years ago

Thanks for the answer. So the wav_pwg is the samples from dataset and wav are the outputs of the model?

kan-bayashi commented 4 years ago

No. Both are model outputs. wav is GST-Taco2 + Grrifin-Lim and wav_pwg is GST-Taco2 + PWG.

roholazandie commented 4 years ago

Sorry to bug you again, the quality of wav_pwg seems reasonable to me. But as you mentioned it is a multispeaker. How does the model decide to output of a certain voice? Is there any control code for that?

kan-bayashi commented 4 years ago

@roholazandie The speaker characteristics can be controlled by style embedding. The style embedding is the weighted sum of global style embedding. Basically, the weight is decided by query audio. We give a text and query audio that is a speech of the target speaker. See the detail in the original paper https://arxiv.org/abs/1803.09017.

Approximetal commented 4 years ago

In GST-Taco2 + PWG, does PWG support multi-speaker? Or it needs fine-tune for each speaker? In my experiments, training PWG with multi-speaker dataset will decrease the similarity of the target speaker.

kan-bayashi commented 4 years ago

Please check the samples of VCTK / JNAS / LibriTTS. https://github.com/kan-bayashi/ParallelWaveGAN#results Basically, it works with multi-speaker dataset. The necessity of fine-tuning depends on the target level of quality.

AyaLahlou commented 4 years ago

Hello, Could you please share the checkpoint for the male text2speech GST-Taco2 + PWG used to get wav_pwg?

kan-bayashi commented 4 years ago

You can get from https://github.com/espnet/espnet_model_zoo.