lss-1138 / SparseTSF

[ICML 2024 Oral] Official repository of the SparseTSF paper: "SparseTSF: Modeling Long-term Time Series Forecasting with 1k Parameters". This work is developed by the Lab of Professor Weiwei Lin (linww@scut.edu.cn), South China University of Technology; Pengcheng Laboratory.
Apache License 2.0
150 stars 16 forks source link

results in icml talk #9

Open kashif opened 4 months ago

kashif commented 4 months ago

Was your results table fixed with respect to the evaluation bug at ICML talk?

Screenshot 2024-07-24 at 16 24 50

lss-1138 commented 4 months ago

Thank you for your question.

No, the results shown at the talk were consistent with those in Table 2 of the main text of the paper.

After the paper was accepted, we noticed the evaluation bug you mentioned, which primarily affected the test results on the ETTh1 and ETTh2 datasets. To assess the impact of this bug on our original conclusions, we added the results in Table 11 to the camera-ready version, which essentially upheld the same conclusions as the original results.

In other words, SparseTSF can achieve near state-of-the-art performance with less than 1,000 parameters.

kashif commented 4 months ago

I find this borderline disingenuous, you should have presented the actual metrics instead of the buggy one, regardless of your messaging.

Again, note that your method is not SOTA by any means... all you have managed to show is that your model beats simplistic linear baselines (which like your model are inherently univariate) as well as over-parameterized multivariate variants of transformer models trained on tiny datasets leading to your false conclusion...

I would ask you not to compare huge multivariate models on tiny datasets where these models tend to learn spurious correlations, especially on simple/small datasets like you consider. Any huge non-linear model in the multivariate setting will suffer from this issue not just the transformer variants. Once you compare univariate models to other non-linear models in the univariate settings (which can potentially incorporate covariates like date-time features etc.) you will see your model is not better.

lss-1138 commented 4 months ago

Thank you for your insightful comments. I believe they are helpful for my further understanding of the time series community. However, I still need to argue some points that might be unfair to our work.

I find this borderline disingenuous, you should have presented the actual metrics instead of the buggy one, regardless of your messaging.

The long-standing bug is indeed a legacy issue that has affected the community for a while. Considering the timing of the bug discovery, the completion of our work, and the potential impact of this bug, we believe our current approach has somewhat mitigated the concerns regarding our paper's conclusions. Of course, in our future work, we will thoroughly address this bug.

Again, note that your method is not SOTA by any means... all you have managed to show is that your model beats simplistic linear baselines (which like your model are inherently univariate) as well as over-parameterized multivariate variants of transformer models trained on tiny datasets leading to your false conclusion...

We acknowledge that our method is not the current SOTA, so we have not emphasized that point in fact.

Most importantly, our work aims to demonstrate that even with such a small parameter count, our method can achieve surprisingly accurate predictions. Therefore, we want the community to focus more on the fundamental driving factors behind long-term forecast tasks, namely periodicity in time series.

Additionally, our model is inherently a linear-based model. So, (1) our ability to beat linear baselines at least demonstrates the effectiveness of our strategy (cross-period sparse forecasting); (2) our method's ability to outperform some over-parameterized models also proves that over-parameterization is meaningless, and we need to focus more on the essence of long-term forecasting tasks (periodicity).

Therefore, we do not consider our conclusion to be incorrect.

I would ask you not to compare huge multivariate models on tiny datasets where these models tend to learn spurious correlations, especially on simple/small datasets like you consider. Any huge non-linear model in the multivariate setting will suffer from this issue not just the transformer variants.

In our paper, we primarily focused on several current popular benchmark datasets, including ETTh, Traffic, and Electricity. In our view, the latter two are already relatively large datasets, not tiny ones. If you have more suitable larger datasets to recommend, we would be happy to consider them in our future work.

Moreover, performance on tiny datasets is also very meaningful in reality, as we often cannot collect sufficiently large and high-quality datasets to train models in many real-world scenarios. Therefore, the failure of huge models on tiny datasets also precisely highlights the need for a smaller, simple method suitable for edge scenarios, and SparseTSF do this.

Once you compare univariate models to other non-linear models in the univariate settings (which can potentially incorporate covariates like date-time features etc.) you will see your model is not better.

As you may have noticed, the fundamental factor distinguishing a model's performance on high-dimensional multivariate datasets is whether the model has non-linear capabilities, rather than other complex module designs. This is because models need non-linear capabilities to remember different patterns of different channels, and linear-based methods without non-linear capabilities fail to do this [1].

[1] Li, Zhe, et al. "Revisiting long-term time series forecasting: An investigation on linear mapping." arXiv preprint arXiv:2305.10721 (2023).

One of the purposes of this paper is to explore the possibilities of extremely lightweight design for long-term forecast models. Therefore, we only used a linear layer as the prediction backbone, rather than other non-linear modules such as MLP. If we replace SparseTSF's linear layer:

self.linear = nn.Linear(self.seg_num_x, self.seg_num_y, bias=False)

with a two-layer MLP:

self.linear = nn.Sequential(
    nn.Linear(self.seg_num_x, 128),
    nn.ReLU(),
    nn.Linear(128, self.seg_num_y)
)

we can find that SparseTSF achieves better prediction performance on high-dimensional multivariate datasets like Electricity and Traffic. For example, on the Traffic dataset with 862 variables, MSE can be further reduced from 0.389 to 0.371.

In conclusion, the significance of this paper is not to propose a SOTA model, but to present an extremely lightweight model suitable for scenarios where huge models are difficult to succeed and to use the impressive results from such a small scale to guide the community to focus more on the essence of long-term forecast tasks.

We hope you can understand our contribution.