Open dazenhom opened 6 years ago
@dazenhom Hi, thanks for your notes. In this repo, using use_gst=False
doesn't mean the tacotron1 model. Google also has another paper, which uses reference encoder to do style and multi-speaker synthesis. You can found it at https://arxiv.org/abs/1803.09047.
@syang1993 Thanks for your reply, I took a mistake with Tacotron1 from your work. I shall find another version of Tacotron1 to run my test. Thanks anyway.
I have try “use_gst=False”, but it seems to be the same as tacotron1? Although the refnet_outputs will change, but the generated audio will hardly change with different reference audio.
@hyzhan In my experienc, maybe it's because of your data. If you use some expressive speakers as your trainning data and do the inference, the speech can be different(changed with the reference audio) . Otherwise, it can remain little different as you mentioned.
Thanks for your great work, but I found that if I set the hyperparameter
use_gst=False
and run, it seemed different from my understanding of Tacotron1. The tacotron.py code is part of here.Original Tacotron1 code shoudn't train with the reference encoder part right? However, your code pass the non-gst mode data into a
reference_encoder
, which sounds strange ? Maybe we can exchange the twoIF
condition codes to make it correct.THANKS