Closed ErnstTmp closed 5 years ago
Can you share the model till now.
Sure, the model is at https://www.dropbox.com/s/fl1vqfz6611s8zw/checkpoint234000.tar?dl=0
Do you have any idea now? I'm facing the same problem
Hi, sorry for the late reply. In my early experiments, I found only when I use some unseen reference audio, the generated sometimes would become bad.
And I think the batch size will affect the quality and stability. I also find multi-gpu training for tacotron is better than single GPU model.
I guess you can reduce the head number (better token diversity) and modify the attention mechanism as we discussed before.
Thanks a lot for your help - batch size seem to be the problem. I had to reduce the batch size due to memory limits - and this seem to have caused the problems.
@ErnstTmp Did you attempt it with multi-gpu at a higher batch size? Was keen on seeing if the results were better.
I switched temporarily to Tacotron-2, and that works with single GPU with more memory (16 GB) and batchsize 32. Also, I could increasing the number of parallel outputs to be able to run batchsize 32 on a 12 GB GPU.
Hi, I trained the Tacotron GST with the default hyperparameters (batchsize 12 for memory reasons) for 6 days (232000 iterations) on a Titan X on the blizzard 2013 dataset, and the alignments are only partly linear, and the end of the text repeats several times, see enclosed file. The end plus the word test are not understandable. Is this caused by the batch size? Thanks and kind regards Ernst