syang1993 / gst-tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"
368 stars 110 forks source link

No clear speech #32

Closed ErnstTmp closed 5 years ago

ErnstTmp commented 5 years ago

Hi, I trained the Tacotron GST with the default hyperparameters (batchsize 12 for memory reasons) for 6 days (232000 iterations) on a Titan X on the blizzard 2013 dataset, and the alignments are only partly linear, and the end of the text repeats several times, see enclosed file. The end plus the word test are not understandable. Is this caused by the batch size? eval-232000_ref-ca-bb-42-08-align Thanks and kind regards Ernst

ishandutta2007 commented 5 years ago

Can you share the model till now.

ErnstTmp commented 5 years ago

Sure, the model is at https://www.dropbox.com/s/fl1vqfz6611s8zw/checkpoint234000.tar?dl=0

MengLiPKU commented 5 years ago

Do you have any idea now? I'm facing the same problem

syang1993 commented 5 years ago

Hi, sorry for the late reply. In my early experiments, I found only when I use some unseen reference audio, the generated sometimes would become bad.

And I think the batch size will affect the quality and stability. I also find multi-gpu training for tacotron is better than single GPU model.

I guess you can reduce the head number (better token diversity) and modify the attention mechanism as we discussed before.

ErnstTmp commented 5 years ago

Thanks a lot for your help - batch size seem to be the problem. I had to reduce the batch size due to memory limits - and this seem to have caused the problems.

ZohaibAhmed commented 5 years ago

@ErnstTmp Did you attempt it with multi-gpu at a higher batch size? Was keen on seeing if the results were better.

ErnstTmp commented 5 years ago

I switched temporarily to Tacotron-2, and that works with single GPU with more memory (16 GB) and batchsize 32. Also, I could increasing the number of parallel outputs to be able to run batchsize 32 on a 12 GB GPU.