Open rotorooter101 opened 5 years ago
Still working on this. There is a lot of varied prosody and emotion in my dataset that I am trying to capture. With a VAE, I can get the teacher_forcing ratio down to 0.25 or so; I think it's possible that with an expanded model and 500k-1M iterations, I could get down to zero.
At tf_ratio=0.25, here is the alignment with 230k steps = 5 days:
And at test (no dropout, tf_ratio=0.0) where audio is basically a feed-forward hum after the first 0.5s:
My dataset surely must be similar to others who have tried e.g. multi-speaker models? My hunch currently is that the audio output from the model is so bad (or varied) at this stage of training that it is not suitable as a query to the attention model; then, the attention layer is only able to to give a fuzzy answer.
Here are the options I am considering:
I'm most optimistic about (3), just because everything else that has relied on attention at test has failed on me. But it seems like major work.
I have been working on TTS for several months now, and my (20+ hour) dataset is driving me crazy. At training time, keithito/tacotron and Rayhane-mamah/Tacotron2 are able to align fine, but when I switch to pure inference (with of course, no teacher forcing) the alignment of the final utterance becomes wishy washy and the wav is completely unintelligible.
Does anyone else have this problem? Maybe only for certain models or datasets, or early in the process?
I have made half-hearted attempts to compensate for what I guess is a difficult dataset: VAE, GST, and forced manual alignments at test and train. Nothing has worked yet, so I mostly just wanted to share my frustration here in case this sounds familiar to anyone.
The most obviously difficult part of my dataset is the prosody -- there are silences of 0.4-4s that cannot be anticipated, hence VAE or GST. Splitting utts to eliminate these gaps did not clearly solve the problem, but perhaps I should revisit.
Related question, does anyone actually use Teacher-Forcing and let the ratio go down to zero? Is that a reasonable goal? I can get down to about 0.8 (from 1.00) by lowering the learning rate, but it fails to train when any lower.