keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

MIT License

2.95k stars 960 forks source link

Question: Multiple speakers voice for training data #72

Closed rild closed 6 years ago

rild commented 6 years ago

TL;DR

Is it better to use the same speaker's voice data for learning?

Question

I am a student studying speech synthesis at university in Japan.

I read paper about Tacotron and wanted to try it myself. So, I try to train tacotron in Japanese. (I was able to confirm that training in English works as intended when using the training data available in the repository)

The problem is that encoder/decoder alignment for Japanese would not be learned well.

Here is step-23000-align.png

For training, the data used is speech data from three speakers, each reading 100 sentences.

I suppose the reason why alignment fails, is as follows:

Too few training data samples
It is learning data of 3 different speakers (multiple speakers)

Thank you.

AzamRabiee commented 6 years ago

I usually work with a toy dataset almost similar size as yours for sanity check; but the alignment shows up quickly. try two things: 1. injection of speaker id as issue #18 . and 2. Monotonic attention

rild commented 6 years ago

@AzamRabiee , Thank you for your reply!

I didn't know the paper: Monotonic attention

I would like to read the issue and monotonic paper, and try that.

rild commented 6 years ago

53

same problem ..?

r9y9 commented 6 years ago

If you are building multi-speaker models (like DeepVoice2, or 3), it should be okay to use data from multiple speakers. However, I'm pretty sure that the reason you got the non-monotonic alignment is that you don't have sufficient data. https://sites.google.com/site/shinnosuketakamichi/publication/jsut is a freely available Japanese dataset that might be useful for you. The dataset consists of 10 hours audio recordings of a single female speaker. I just started to explore the dataset today with DeepVoice3 architecture and can get nearly monotonic attention very quickly as follows:

step000005000_layer_1_alignment

Code is available at https://github.com/r9y9/deepvoice3_pytorch. If you are interested, feel free to contact me.

rild commented 6 years ago

@r9y9 , Thank you!

I was just trying to make simplest TTS system, so there was no intention to build multi-speaker model now.

I got the training dataset from here.

I feel the corpuses seems to be too long to train tacotron. Yesterday, I run train.py with 100 speech data (single, no-emotional speech dataset), but I didn't get good result.

the dataset (arxiv) looks good for training!

rild commented 6 years ago

@r9y9

If you are interested, feel free to contact me.

I feel very encouraged to hear that. Thank you so much. I would like to send a email you later.