thuml / Autoformer

About Code release for "Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting" (NeurIPS 2021), https://arxiv.org/abs/2106.13008
MIT License
1.96k stars 422 forks source link

Regarding the "batch normalization style design" in AutoCorrelation during training #167

Closed linfeng-du closed 1 year ago

linfeng-du commented 1 year ago

Dear authors, I noticed in the code that you aggregate (by taking the mean of) the correlation scores across heads and channels before selecting clog(L_KV) delays, which seems reasonable. However, during training you also aggregate across examples in the batch, could you elaborate on why you do that and why doing so would be beneficial? Is it mainly for speeding up training since examples in the same batch should have the same periodicity (hence roughly the same selected delays)? Using the same delays for each example in a batch during training would impose some inductive bias to the model. Also, is that the only difference between time_delay_agg_train and time_delay_agg_inference?

Besides, is there any difference between these two methods and time_delay_agg_full cuz implementation-wise they look similar.

Thanks!

wuhaixu2016 commented 1 year ago

Hi, thanks for your interest. (1) why aggregate across examples during training? As you stated, one obvious benefit is to speed up training. Besides, aggregating cross examples can also reduce noise in period estimation, bring better aggregation among sub-series. (2) is that the only difference between time_delay_agg_train and time_delay_agg_inference. Yes. In time_delay_agg_train, the period is estimated from multiple sample. In time_delay_agg_inference, the period is calculated sample by sample. (3) is there any difference between these two methods and time_delay_agg_full? The time_delay_agg_full will provide channel-wise lags, which is a complete version of sample-wise-lag time_delay_agg_inference.