keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.96k stars 956 forks source link

Less dataset needed for training #170

Open begeekmyfriend opened 6 years ago

begeekmyfriend commented 6 years ago

I have tried 10h dataset on Rayhane Mamah's version which is derived from this project https://github.com/Rayhane-mamah/Tacotron-2/issues/35. I wonder what improvement can result in less dataset needed and convergence at less than 5K steps.

begeekmyfriend commented 6 years ago

Hi all, I have been conducting series of ablation studies to understand the key components affecting the quantity of training dataset needed these days. Now I have trained on small dataset (~10h) and get to convergence in 10K steps based on the Keith Ito's tacotron2-work-in-progress branch. And I have released my minimal modification on my repo. Most of the code is derived from Rayhane Mamah's version. step-10000-align We can see that the alignment is not very clear and consecutive because this is only the ablation study on small dataset convergence and the modification is the minimum. Later I will give some advice to improve alignments such as trimming leading and trailing silence of the audio.

In my opinion, there are two dispensable factors: The decoder RNN architecture and the normalization of mel-spectrograms. I say they are dispensable because it would fail to convergence on small dataset without either of two modifications.

First let us talk about the decoder RNN architecture. The designs between Keith and Rayhane are different. Keith used AttentionWrapper with an extra attention RNN wrapped in it and the query of the attention mechanism is the hidden states of the attention RNN. In Rayhane's design there is no attention RNN and the query is the hidden states of decoder LSTM. The modification is in rnn_wrapper.py. And this post illustrates the structure of both two decoder RNN architecture. The conclusion we can learn from it is that AttentionWrapper might be inappropriate for Tacotron model because attention RNN is redundant in decoder structure. 2-layer LSTM does the same thing and it is closer to the output frames.

Second we should focus on the normalization of mel-spectrograms. As Rayhane has illustrated, the distances of spectrograms data points in his model is bigger than Keith's, and the mean value of the distribution is zero which is known as symmetry. I find that crucial for convergence on small dataset because when I scale the range [-4, 4] to [0, 1], [0, 8] or [-1, 1] it would fail to convergence. We can see the modification in audio.py. And here comes the conclusion that [0, 1] normalization might not be the best solution for small dataset convergence.

These two components above are the key for whether we can use less dataset to get to convergence or not. However, there are still some optional methods to improve or speed up the convergence. For instance, we can conduct stricter silence trim for audio to get better alignments. Currently I am using librosa.effects.trim but the alignment seems not good enough. I am going to improve and show the results later.

Last but not least, my repo currently is only for ablation study. Please do NOT treat it as an improved Tacotron model and expect better results. You can try to use Rayhane's version for validation.

begeekmyfriend commented 6 years ago

Another discovery, we can reduce num_freq (fft_size) in half to speed up convergence. I have set num_freq to 513 (fft_size as 1024) and the convergence appears in 6K steps. I have trained twice to confirm it. step-6000-align

begeekmyfriend commented 6 years ago

The alignment looks better than the first floor when cumulative attention states has been added into location sensitive attention mechanism. And I have also increased the number of filters as well as the kernel size of location convolution . Here is the commit https://github.com/begeekmyfriend/tacotron/commit/ac23e1b019221d3ac00017410678b4610787b1f1. But the alignment is still not consecutive and I will keep improving and watching the results. step-5000-align

begeekmyfriend commented 6 years ago

Now the alignment seems all right in 3K steps if we add bias variable in _location_sensitive_score. See https://github.com/begeekmyfriend/tacotron/commit/ab8aacf3298900d49ab7656ad061a4a99f41430d As we can see, the bias is dispensable!! step-3000-align

begeekmyfriend commented 6 years ago

Commit for L2 regularization https://github.com/begeekmyfriend/tacotron/commit/e13c3b7d7baa3c239ce77c743b1deb262deb1a49. This alignment in 7K steps is with L1 regularization with r==2. We can see it is not consecutive enough. P.S. The one on the last floor is with L2. step-7000-align

ryancwalsh commented 6 years ago

https://github.com/Kyubyong/speaker_adapted_tts mentions only needing 1 minute of sample data.

There are few instructions though.

I haven't learned even the first step in how to synthesize a new voice based on my own audio recordings.