maxjcohen / transformer

Implementation of Transformer model (originally from Attention is All You Need) applied to Time Series.
https://timeseriestransformer.readthedocs.io/en/latest/
GNU General Public License v3.0
842 stars 165 forks source link

question about decoder input #41

Closed weiHelloWorld closed 3 years ago

weiHelloWorld commented 3 years ago

Hi,

Thanks for this excellent implementation!

I have been playing with this model, and am wondering about a small detail regarding the decoder input. In here, you use the encoding output as the decoder input (which also serves as memory for the decoder layer here), instead of using the output target as the decoder input (as is given by trg in this post for the translation task). Do you have idea why we do not use output target as decoder inputs? I assume that with proper masking, future information would not be included in the prediction, but am not sure if I miss anything. Thank you!

maxjcohen commented 3 years ago

Hi, the decoder does indeed takes as input the output target in the original paper. But training the Transformer that way requires running an iterative prediction, i.e. predicting output variables at time t, and running the decoding with this additional prediction to get a prediction for t+1. Because we are dealing with very long time sequences, much longer than usual sentences (672 time steps for a month worth of data for instance), we can't afford this kind of computation cost.

Instead, I am treating the Transformer as an improved encoder-decoder architecture, inspired by this paper.

weiHelloWorld commented 3 years ago

Got it, thanks for the explanation!