romilbert / samformer

Official implementation of SAMformer, a transformer leveraging Sharpness-Aware Minimization and Channel-Wise Attention for Time Series Forecasting.
MIT License
130 stars 18 forks source link

About model parameters. #15

Closed Alsac closed 1 week ago

Alsac commented 2 months ago

Hello! Your work has given me a lot of inspiration, thank you.

Currently, I'm facing a point of confusion regarding the parameter count of SAMformer mentioned in the paper. I found that in the PyTorch version of the code, there are linear networks for Q, K, and V, as well as a linear_forecaster. If seq_len=512, hid_dim=16, and pred_horizon=96 (based on my understanding), then the model's parameters could become very large. Could you help resolve my confusion? Once again, thank you for your great work!

图片 图片

ambroiseodt commented 1 week ago

Hello, thanks for your message.

The parameter count of SAMformer takes into consideration the linear weight matrices from the query, key and value in the attention as well as the last linear layer to forecast the horizon H. Hence, it depends on the seq_length, hid_dim, and horizon values. The parameter count established on the paper corresponds to the default value of such parameters in the original implementation. This demonstrates that SAMformer is particularly efficient compared to baselines and in particular compared to TSMixer is the most efficient competitor. However, you are right in saying that this can increase and potentially become very large for very high values of horizon for instance. In our experiments, until horizon=720, SAMformer remains more efficient.

I hope this answers your question. Don't hesitate to open an issue or send a mail if you have additional questions.

Ambroise

romilbert commented 1 week ago

Hello Alzac,

Thank you very much for your very relevant remark. Indeed, there was a slight miscalculation in the parameters for SAMformer and TSMixer. The number of parameters for SAMformer was slightly underestimated. This number of parameters is equal to $L \times (4 \cdot d_m + H) = 512 \times (64 + H)$, whereas those for TSMixer were significantly underestimated.

I am sharing here the updated table to reflect these changes. This will also be updated in the paper.

In conclusion, SAMformer is not 3.73 times smaller than TSMixer in terms of the number of parameters but 10.67 times smaller than TSMixer on average.

Thank you again for pointing this out!

Romain

Dataset H=96 (SAMformer) H=96 (TSMixer) H=192 (SAMformer) H=192 (TSMixer) H=336 (SAMformer) H=336 (TSMixer) H=720 (SAMformer) H=720 (TSMixer)
ETT 81,920 576,604 131,072 625,948 204,800 699,628 401,408 896,620
Exchange 81,920 1,219,084 131,072 1,398,344 204,800 1,732,396 401,408 3,696,904
Weather 81,920 1,105,598 131,072 1,154,942 204,800 1,228,622 401,408 1,425,614
Electricity 81,920 1,266,502 131,072 1,315,846 204,800 1,389,526 401,408 1,586,518
Traffic 81,920 3,042,412 131,072 3,091,756 204,800 3,165,436 401,408 3,362,428
Horizon (H) SAMformer (avg params) TSMixer (avg params) Avg Ratio (TSMixer/SAMformer)
H=96 81,920 1,442,040 17.60
H=192 131,072 1,517,367 11.58
H=336 204,800 1,643,121 8.02
H=720 401,408 2,193,616 5.46
AVG 10.67