keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.96k stars 958 forks source link

Alignment fine during training, but fails at test #269

Open rotorooter101 opened 5 years ago

rotorooter101 commented 5 years ago

I have been working on TTS for several months now, and my (20+ hour) dataset is driving me crazy. At training time, keithito/tacotron and Rayhane-mamah/Tacotron2 are able to align fine, but when I switch to pure inference (with of course, no teacher forcing) the alignment of the final utterance becomes wishy washy and the wav is completely unintelligible.

Does anyone else have this problem? Maybe only for certain models or datasets, or early in the process?

I have made half-hearted attempts to compensate for what I guess is a difficult dataset: VAE, GST, and forced manual alignments at test and train. Nothing has worked yet, so I mostly just wanted to share my frustration here in case this sounds familiar to anyone.

The most obviously difficult part of my dataset is the prosody -- there are silences of 0.4-4s that cannot be anticipated, hence VAE or GST. Splitting utts to eliminate these gaps did not clearly solve the problem, but perhaps I should revisit.

Related question, does anyone actually use Teacher-Forcing and let the ratio go down to zero? Is that a reasonable goal? I can get down to about 0.8 (from 1.00) by lowering the learning rate, but it fails to train when any lower.

rotorooter101 commented 5 years ago

Still working on this. There is a lot of varied prosody and emotion in my dataset that I am trying to capture. With a VAE, I can get the teacher_forcing ratio down to 0.25 or so; I think it's possible that with an expanded model and 500k-1M iterations, I could get down to zero.

At tf_ratio=0.25, here is the alignment with 230k steps = 5 days: istep-align copy

And at test (no dropout, tf_ratio=0.0) where audio is basically a feed-forward hum after the first 0.5s: istep-align-valid copy

My dataset surely must be similar to others who have tried e.g. multi-speaker models? My hunch currently is that the audio output from the model is so bad (or varied) at this stage of training that it is not suitable as a query to the attention model; then, the attention layer is only able to to give a fuzzy answer.

Here are the options I am considering:

  1. Enlarge the VAE, eliminate the extra long pauses in the dataset, and train it for a week. Maybe it'll work. [Also add a stop token, which is clearly needed]
  2. Expand the attention network with extra layers, or add the location-based version. Maybe the attention network just needs to be less stressed about what audio it is receiving.
  3. Make a separate network to predict alignment + f0, as in Google's recent "Sample Efficient" paper. Since prosody seems to be the main stressor for attention, the spectrogram network can just be a stack of LSTM which is fed f0 + upsampled characters. The current failure mode will then be impossible.

I'm most optimistic about (3), just because everything else that has relied on attention at test has failed on me. But it seems like major work.