keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.96k stars 959 forks source link

concatenation of attention context vector with the last LSTM output #120

Open lgit2017 opened 6 years ago

lgit2017 commented 6 years ago

@keithito In Tacotron2 paper https://arxiv.org/abs/1712.05884, the authors mention that "The concatenation of the LSTM output and the attention context vector is then projected through a linear transform to produce a prediction of the target spectrogram frame.". Was there a reason you did not concatenate the attention context vector with the last LSTM output?

rafaelvalle commented 6 years ago

@lgit2017 this is not done inside of the BasicDecoder, right?

keithito commented 6 years ago

I tried, but it failed to learn an alignment (after running for about 30k steps). It's on my TODO list to figure out why, but I haven't had much time to work on this repo lately. If have time to give it a shot, please let me know how it goes!

rafaelvalle commented 6 years ago

@keithito what data are you feeding into the DecoderRNN? For Tacotron 2 they use the attention context and decoder's previous output, not the attention's previous output. I think you're doing the latter... https://github.com/keithito/tacotron/blob/tacotron2-work-in-progress/models/tacotron.py#L65