keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.94k stars 965 forks source link

Audio generated using eval is not same as generated by demo_server for the same checkpoint #314

Open prateekgupta891 opened 4 years ago

prateekgupta891 commented 4 years ago

I have tried training on Emotion Dataset which have multiple emotions and same text. So while training at every checkpoint it generates an audio file using some text (dont know how it samples the text from the training data), and the audio sounds good too. But if take that model file and give it as a checkpoint to the demo_server.py code to generate the audio for the same text it do a terrible job. I have already trained it for 200K iterations, but still not able to generate anything except a muffled voice and some noise using the demo_server code. Is there is a difference between what the eval and the demo_server code? Please help!!

ghost commented 4 years ago

It has to do with "teacher forcing" or something. Basically eval is outputting what it is training on while using teacher forcing. The demo server doesn't use teacher forcing. So yes, there is a difference in how they will sound. Eval will sound better. I'm not really an expert so that's all I know.