Open jlin816 opened 6 years ago
Are the synthesized utterances good even when your not running teacher-forced, that is your nor providing the mel-spectrograms to the tacotron decoder. During inference time when we don't have access to mel-spectrograms, the models I've trained are not able to do proper inference unless they learn attention.
I've just been using the eval.py
script and Synthesizer
, and those synthesized examples sound fine. I'll look into the code later tonight, but is there a specific argument I have to set to disable teacher forcing?
training by default should be teacher forced. do you see correct alignments during training?
Sorry, to clarify: the alignment above is from training (teacher-forced), and synthesized samples sound good.
The alignment below is from eval.py
(my question above being - does Synthesizer
use teacher forcing by default?). The samples also sound good, aside from the echoing mentioned in #133.
I see. The synthesized samples sound good because you're providing the real samples to the model, i.e teacher forcing. During inference time the model does must rely on the samples that it has generated and without proper attention alignments it is pretty hard for it to synthesize the expected outputs.
@jlin816 can you explain how you got the eval.py to also produce the alignments?
figured it out. For anyone wondering, in the Synthesizer class synthesize function wav, alignment = self.session.run([self.wav_output, self.model.alignments[0]], feed_dict=feed_dict)
Hello, everyone! why alignment picture like this?
My alignment picture like below while no good result can achieve using demo_server
What is the relationship between the alignment and the synthesis quality? In particular, my alignment looks bad, but the synthesized utterances sound quite good. My training set is relatively small -- should I interpret the bad alignment as it memorizing the training set, perhaps from the last few frames of input?