About model parameters.

Alsac commented 2 months ago

Hello! Your work has given me a lot of inspiration, thank you.

Currently, I'm facing a point of confusion regarding the parameter count of SAMformer mentioned in the paper. I found that in the PyTorch version of the code, there are linear networks for Q, K, and V, as well as a linear_forecaster. If seq_len=512, hid_dim=16, and pred_horizon=96 (based on my understanding), then the model's parameters could become very large. Could you help resolve my confusion? Once again, thank you for your great work!

ambroiseodt commented 1 week ago

Hello, thanks for your message.

The parameter count of SAMformer takes into consideration the linear weight matrices from the query, key and value in the attention as well as the last linear layer to forecast the horizon H. Hence, it depends on the seq_length, hid_dim, and horizon values. The parameter count established on the paper corresponds to the default value of such parameters in the original implementation. This demonstrates that SAMformer is particularly efficient compared to baselines and in particular compared to TSMixer is the most efficient competitor. However, you are right in saying that this can increase and potentially become very large for very high values of horizon for instance. In our experiments, until horizon=720, SAMformer remains more efficient.

I hope this answers your question. Don't hesitate to open an issue or send a mail if you have additional questions.

Ambroise

romilbert commented 1 week ago

Hello Alzac,

Thank you very much for your very relevant remark. Indeed, there was a slight miscalculation in the parameters for SAMformer and TSMixer. The number of parameters for SAMformer was slightly underestimated. This number of parameters is equal to $L \times (4 \cdot d_m + H) = 512 \times (64 + H)$, whereas those for TSMixer were significantly underestimated.

I am sharing here the updated table to reflect these changes. This will also be updated in the paper.

In conclusion, SAMformer is not 3.73 times smaller than TSMixer in terms of the number of parameters but 10.67 times smaller than TSMixer on average.

Thank you again for pointing this out!

Romain

Dataset	H=96 (SAMformer)	H=96 (TSMixer)	H=192 (SAMformer)	H=192 (TSMixer)	H=336 (SAMformer)	H=336 (TSMixer)	H=720 (SAMformer)	H=720 (TSMixer)
ETT	81,920	576,604	131,072	625,948	204,800	699,628	401,408	896,620
Exchange	81,920	1,219,084	131,072	1,398,344	204,800	1,732,396	401,408	3,696,904
Weather	81,920	1,105,598	131,072	1,154,942	204,800	1,228,622	401,408	1,425,614
Electricity	81,920	1,266,502	131,072	1,315,846	204,800	1,389,526	401,408	1,586,518
Traffic	81,920	3,042,412	131,072	3,091,756	204,800	3,165,436	401,408	3,362,428

Horizon (H)	SAMformer (avg params)	TSMixer (avg params)	Avg Ratio (TSMixer/SAMformer)
H=96	81,920	1,442,040	17.60
H=192	131,072	1,517,367	11.58
H=336	204,800	1,643,121	8.02
H=720	401,408	2,193,616	5.46
AVG			10.67

romilbert / samformer

About model parameters. #15