huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.05k stars 27.02k forks source link

Add the LARS optimizers for training large scale CNN model with larger batch size #33712

Closed dame-cell closed 1 month ago

dame-cell commented 1 month ago

Feature request

I was training large-scale convolutional neural networks (CNNs), such as ResNet-50, with large batch sizes and I noticed that it was quite challenging when using traditional optimizers like SGD or AdamW. As the batch size increases, these optimizers often struggle with stability and can lead to slower convergence or degraded performance.

However, Layer-wise Adaptive Rate Scaling (LARS) is specifically designed to address the issues associated with large-batch training and has been successfully applied in many state-of-the-art CNN models, especially in distributed training scenarios.

image

the above plot was taken from pytorch-lars

Paper

Large Batch Training of Convolutional Networks

Motivation

Motivation

Training CNNs like ResNet-50 with large batch sizes often leads to instability and slower convergence when using optimizers like SGD or AdamW. LARS (Layer-wise Adaptive Rate Scaling) is designed to address these challenges by applying layer-wise learning rate scaling, ensuring stable, efficient training in large-scale, distributed settings. This enables faster convergence and better performance for large-batch CNN training.

Key benefits of LARS include:

LysandreJik commented 1 month ago

I'd be curious in your opinion @rwightman on whether this would be a worthwhile integration

cc @amyeroberts and @qubvel as well

dame-cell commented 1 month ago

I'd be curious in your opinion @rwightman on whether this would be a worthwhile integration

cc @amyeroberts and @qubvel as well

Yes but this optimizers is only for cnn models I'm not sure if you guys are good with that

rwightman commented 1 month ago

@dame-cell @LysandreJik I've trained a lot of CNN models, have never found an amazing hparam space for LARS so I've only ever used it a few times and never produced a 'keeper' for model weights with it, I've also seen very very few convnets and other vision models trained with it and I've looked at a lot of trained models.

I did bring it in to timm, added LARC options, cleaned it up to work with PT XLA, but in hindsight wouldn't have bothered. Those are my 2-cents.

dame-cell commented 1 month ago

@dame-cell @LysandreJik I've trained a lot of CNN models, have never found an amazing hparam space for LARS so I've only ever used it a few times and never produced a 'keeper' for model weights with it, I've also seen very very few convnets and other vision models trained with it and I've looked at a lot of trained models.

I did bring it in to timm, added LARC options, cleaned it up to work with PT XLA, but in hindsight wouldn't have bothered. Those are my 2-cents.

Thanks for sharing your experience! You're right, it seems like LARS isn't adding enough value in most cases. Let's leave it out for now. Appreciate the insights!

LysandreJik commented 1 month ago

Thanks Ross :hugs: