Closed weiHelloWorld closed 3 years ago
Hi, the decoder does indeed takes as input the output target in the original paper. But training the Transformer that way requires running an iterative prediction, i.e. predicting output variables at time t
, and running the decoding with this additional prediction to get a prediction for t+1
. Because we are dealing with very long time sequences, much longer than usual sentences (672 time steps for a month worth of data for instance), we can't afford this kind of computation cost.
Instead, I am treating the Transformer as an improved encoder-decoder architecture, inspired by this paper.
Got it, thanks for the explanation!
Hi,
Thanks for this excellent implementation!
I have been playing with this model, and am wondering about a small detail regarding the decoder input. In here, you use the encoding output as the decoder input (which also serves as
memory
for the decoder layer here), instead of using the output target as the decoder input (as is given bytrg
in this post for the translation task). Do you have idea why we do not use output target as decoder inputs? I assume that with proper masking, future information would not be included in the prediction, but am not sure if I miss anything. Thank you!