Can samformer work better for deepen layers?

romilbert / samformer

Official implementation of SAMformer, a transformer leveraging Sharpness-Aware Minimization and Channel-Wise Attention for Time Series Forecasting.

MIT License

130 stars 18 forks source link

Thank you for your feedback! We have tested various architectures for SAMformer, and we found that a shallow model tends to perform better, particularly because transformers have a tendency to overfit quickly on these types of data. Thus, we recommend keeping the number of layers low for optimal performance. If you really want to use multiple layers, I would suggest increasing the strength of the regularization by increasing the value of rho when using SAM, to help mitigate the risk of overfitting.

romilbert / samformer

Can samformer work better for deepen layers? #10