romilbert / samformer

Official implementation of SAMformer, a transformer leveraging Sharpness-Aware Minimization and Channel-Wise Attention for Time Series Forecasting.
MIT License
130 stars 18 forks source link

Can samformer work better for deepen layers? #10

Closed Alsac closed 3 months ago

Alsac commented 4 months ago

Thank you for your job! I tested this code, I find the performance of smaformer is great when there is one layer of scaled_dot_product_attention, other the mse will decreased. can you have any methods to deepen it? thank you!

romilbert commented 3 months ago

Thank you for your feedback! We have tested various architectures for SAMformer, and we found that a shallow model tends to perform better, particularly because transformers have a tendency to overfit quickly on these types of data. Thus, we recommend keeping the number of layers low for optimal performance. If you really want to use multiple layers, I would suggest increasing the strength of the regularization by increasing the value of rho when using SAM, to help mitigate the risk of overfitting.