Zeros for Decoder Input

thuml / Autoformer

About Code release for "Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting" (NeurIPS 2021), https://arxiv.org/abs/2106.13008

MIT License

2k stars 430 forks source link

Zeros for Decoder Input #101

Closed deeepwin closed 2 years ago

deeepwin commented 2 years ago

I noticed that during training and testing you feed the decoder input with zeros using torch.zeros_like() for the future input values. Why do you do that? Wouldn't it make more sense to use teacher forcing, providing the decoder with the true future value? During inference output would need to be feed back? Not sure if the Auto-Correlation does not permit that, but for the vanilla transformer version that should be possible.

Appreciate your view on that. Thanks.

wuhaixu2016 commented 2 years ago

Hi, Since this is a forecasting task, the "true future value" is inaccessible.

And I do not understand some of your questions. (1) What is "teacher forcing"? (2) What does the feed back mean in the "during inference output would need to be feed back?

Would you please detail these concepts?

I think Auto-Correlation is equally usability as well as the vanilla Transformer.

deeepwin commented 2 years ago

Hi, thanks for your quick response!

Yes, future value won't be accessible during inference, but during training you can use the ground truth (data_y) in your code.

(1) I meant teacher forcing as it can be done in a sequence to sequence model, see here. I was just wondering, if you had a reason why not to use it and why you used all zeros as future values for the decoder.

(2) If you do teacher forcing, prediction would be time step by time step. Because, you do not have access to the true future value, you would feed back the predictions to the decoder inputs in a teach forcing scheme.

Here is an article doing the same thing.

wuhaixu2016 commented 2 years ago

Thanks for your explanation.

（1）Seq2Seq framework suffers a lot in accumulate error and can not adopt parallel computation. Thus, we use this one-step forecasting strategy for this long-term forecasting setting.

（2）Schedule sampling can be helpful not limited in Seq2Seq, but also in our one-step forecasting setting. It is a good idea.

We have also explored the teacher forcing before and proposed the reversed schedule sampling. Maybe you can be interested in this paper https://arxiv.org/pdf/2103.09504.pdf.

deeepwin commented 2 years ago

Great, thanks a lot for the explanation.