keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.96k stars 956 forks source link

Bad alignment / good synthesis #141

Open jlin816 opened 6 years ago

jlin816 commented 6 years ago

What is the relationship between the alignment and the synthesis quality? In particular, my alignment looks bad, but the synthesized utterances sound quite good. My training set is relatively small -- should I interpret the bad alignment as it memorizing the training set, perhaps from the last few frames of input?

screen shot 2018-04-05 at 10 58 21 am
rafaelvalle commented 6 years ago

Are the synthesized utterances good even when your not running teacher-forced, that is your nor providing the mel-spectrograms to the tacotron decoder. During inference time when we don't have access to mel-spectrograms, the models I've trained are not able to do proper inference unless they learn attention.

jlin816 commented 6 years ago

I've just been using the eval.py script and Synthesizer, and those synthesized examples sound fine. I'll look into the code later tonight, but is there a specific argument I have to set to disable teacher forcing?

rafaelvalle commented 6 years ago

training by default should be teacher forced. do you see correct alignments during training?

jlin816 commented 6 years ago

Sorry, to clarify: the alignment above is from training (teacher-forced), and synthesized samples sound good. The alignment below is from eval.py (my question above being - does Synthesizer use teacher forcing by default?). The samples also sound good, aside from the echoing mentioned in #133. eval-1000

rafaelvalle commented 6 years ago

I see. The synthesized samples sound good because you're providing the real samples to the model, i.e teacher forcing. During inference time the model does must rely on the samples that it has generated and without proper attention alignments it is pretty hard for it to synthesize the expected outputs.

LearnedVector commented 6 years ago

@jlin816 can you explain how you got the eval.py to also produce the alignments?

LearnedVector commented 6 years ago

figured it out. For anyone wondering, in the Synthesizer class synthesize function wav, alignment = self.session.run([self.wav_output, self.model.alignments[0]], feed_dict=feed_dict)

lT-jk commented 6 years ago

Hello, everyone! why alignment picture like this? step-1200-align

shartoo commented 6 years ago

My alignment picture like below while no good result can achieve using demo_server step-20000-align