Closed lisabug closed 8 years ago
You need clones to store intermediate activations. These are needed for correctly computing gradients.
Sent from mobile On Oct 29, 2015 4:50 PM, "Yuanqin Lu" notifications@github.com wrote:
I'm curious about why should we unroll the LSTM to T timesteps. Since all copies share the same parameters, every time doing bp step makes the LSTM's parameters and gradPrameters changing. Why do we just use one LSTM make fp repeatedly and then bp repeatedly. Because I wanted imply LSTM as a decoder to generate sentence, I have to handle the sequence of variant length. However, I tried my thought but failed. Could someone help me, thanks honestly.
— Reply to this email directly or view it on GitHub https://github.com/oxford-cs-ml-2015/practical6/issues/5.
I got it, thanks for your help:)
I'm curious about why should we unroll the LSTM to T timesteps. Since all copies share the same parameters, every time doing bp step makes the LSTM's parameters and gradPrameters changing. Why do we just use one LSTM make fp repeatedly from t = 1 to T and then bp from t = T to 1 repeatedly. Because I wanted imply LSTM as a decoder to generate sentence, I have to handle the sequence of variant length. However, I tried my thought but failed. Could someone help me, thanks honestly.