salesforce / fsnet

BSD 3-Clause "New" or "Revised" License
113 stars 23 forks source link

FSNet may not beat the naive model #4

Open HappyWalkers opened 1 year ago

HappyWalkers commented 1 year ago

I have done some experiments with the FSNet and it works well. However, after plotting the ground truth and the prediction made by FSNet, I realized that the naive model, simply shifting the truth by one step left, is also a strong model. By shifting the truth only one step, I mean using only the latest data point to predict the future data point. Astonishingly, the naive model beats FSNet on the ETTh2 dataset! I don't have the code and results now. But roughly speaking, the MSE of the naive model on the ETTh2 dataset is 0.40 while the MSE of FSNet is 0.466. I believe the results are easy to reproduce.

Another interesting thing is the precision achieved by the backbone TS2Vec is way better than the one achieved by FSNet. The MSE for univariate forecasting on the ETTh2 dataset with a forecasting horizon of 24 achieved by TS2Vec is 0.090, while the MSE on the same dataset and horizon achieved by FSNet is 0.687. The following tables are reported in FSNet and TS2Vec papers respectively. Screenshot from 2023-09-03 11-14-07 Screenshot from 2023-09-03 11-06-43

The comparison between TS2Vec and FSNet is not surprising because FSNet assumes the streaming data and gives up the usage of batch training. However, FSNet also is defeated by the naive model, which is a little embarrassing. The naive model is pretty strong in the online learning task because it can rapidly adapt to the abrupt changing points of the time series hence reducing the prediction error. The reason behind the rapid learning feature of the naive model is that the naive model is only one step slower than the target time series while the neural network still needs multiple steps to follow the abrupt changes. (I had a pretty good figure to explain this but the figure is not available now.) The neural network is sensitive to abrupt changes in online learning situations because the loss would be huge when the abrupt changes happen and the network is forced to adapt to it to reduce the error. However, the fixed step length limits the adaptability of the network, leading to a dilemma of fast learning and overreaction. (I also had a good figure but ...)

A good evidence of the strong naive model and the dilemma is the sequence length of the time series used to predict. Although FSNet and its backbone, the TS2Vec, claim to use multiple past time steps to predict the future value. The regressor of their network only takes in the last intermediate representation to predict the future value. The statement in TS2Vec is as follows, Screenshot from 2023-09-03 11-37-51

The corresponding code in TS2Vec and FSNet is as follows,

https://github.com/yuezhihan/ts2vec/blob/main/tasks/forecasting.py Screenshot from 2023-09-03 11-46-10

https://github.com/salesforce/fsnet/blob/main/exp/exp_fsnet.py Screenshot from 2023-09-03 11-47-35

I also tried using all the intermediate representations to predict the future values but it turns out the precision is worse than predicting with only the last representation. This result breaks my belief that the prediction with longer sequence input will perform better. I guess the reason is that the longer sequence input makes the fast-learning feature hard to achieve and makes the sequence mapping between the past and the future hard to learn. I guess that's why TS2Vec uses only the last hidden representation.

The FSNet paper precisely describes the difficulties of online learning and the dilemma between fast learning and persistent memory. The one-batch training is computation-efficient but the performance could be improved further.

phquang commented 1 year ago

Hi @HappyWalkers,

Many thanks for your interests in our work.

FSNet vs naive model You are right that the naive model can achieve MSE of 0.4 on ETTh2 with forecast window H=1. But it will perform poorly with longer forecast windows and on other datasets. I believe the naive model you mentioned can be referred to as Historical Inertia (HI) [1]. After this work got published, I compared with this model in my internal report and the result was as follow (it has been a while so this result might not be the latest implementation, but it is a pretty good one).

image

[1] Cui, Yue, Jiandong Xie, and Kai Zheng. "Historical inertia: A neglected but powerful baseline for long sequence time-series forecasting." Proceedings of the 30th ACM international conference on information & knowledge management. 2021.

FSNet vs TCN The table in TCN you provided was for univariate forecasting, while our experiments were multivariate. I believe we already compared with TCN and named this baseline as OnlineTCN.

Using last time step representation for forecasting Given that TCN uses large dilation rates, the last time step in the final layer already captured the representation of the whole sequence. However, I am also not sure why the authors decided to only use the last time step but not max pooling or mean pooling the whole sequence. In this project, we investigated into how to facilitate the fast online forecasting given a backbone, which we decided to adopt TCN. In our openreview discussion (https://openreview.net/forum?id=q-PbpHD3EOk), we also explored the MLP backbone, which I believed we used max-pooling to combine the features (Follow-up response to Reviewer r44V (part 2)). We also observed similar performance gains of FSNet compared to other strategies in this case.

Benefits of longer input sequences Longer input sequences is more informative but also contains more noises. Thus, we cannot conclude that the current backbone will perform better with longer input sequences. Even in the batch training setting, different methods may use different input sequence lengths. I think this is a challenge not only in online forecasting but also in the conventional setting.

Hope my reply clarifies your concerns. Quang

HappyWalkers commented 1 year ago

@phquang

Thanks for your reply!

FSNet vs naive model

I agree with you that the HI method or the naive model performs poorly in the long sequence prediction because the HI method didn't actually make sense when performing such tasks. The HI method basically assumes a local repetition exists in the time series.

However, to pursue better prediction precision, a one-length prediction window is preferable. Therefore, a good model that can achieve an error that is lower than the one achieved by the HI method deserves more effort.

FSNet vs TCN

Thanks for pointing out that.

Using last time step representation for forecasting

I agree with you that the receptive field of the last time step in the final layer is large enough. Then the backbone could be accelerated by pruning useless convolution operations. As described in the following figure, all the value that is not captured by the last time step in the final layer could be pruned.

Screenshot from 2023-09-05 10-06-06