Closed dame-cell closed 1 month ago
I'd be curious in your opinion @rwightman on whether this would be a worthwhile integration
cc @amyeroberts and @qubvel as well
I'd be curious in your opinion @rwightman on whether this would be a worthwhile integration
cc @amyeroberts and @qubvel as well
Yes but this optimizers is only for cnn models I'm not sure if you guys are good with that
@dame-cell @LysandreJik I've trained a lot of CNN models, have never found an amazing hparam space for LARS so I've only ever used it a few times and never produced a 'keeper' for model weights with it, I've also seen very very few convnets and other vision models trained with it and I've looked at a lot of trained models.
I did bring it in to timm
, added LARC options, cleaned it up to work with PT XLA, but in hindsight wouldn't have bothered. Those are my 2-cents.
@dame-cell @LysandreJik I've trained a lot of CNN models, have never found an amazing hparam space for LARS so I've only ever used it a few times and never produced a 'keeper' for model weights with it, I've also seen very very few convnets and other vision models trained with it and I've looked at a lot of trained models.
I did bring it in to
timm
, added LARC options, cleaned it up to work with PT XLA, but in hindsight wouldn't have bothered. Those are my 2-cents.
Thanks for sharing your experience! You're right, it seems like LARS isn't adding enough value in most cases. Let's leave it out for now. Appreciate the insights!
Thanks Ross :hugs:
Feature request
I was training large-scale convolutional neural networks (CNNs), such as ResNet-50, with large batch sizes and I noticed that it was quite challenging when using traditional optimizers like SGD or AdamW. As the batch size increases, these optimizers often struggle with stability and can lead to slower convergence or degraded performance.
However, Layer-wise Adaptive Rate Scaling (LARS) is specifically designed to address the issues associated with large-batch training and has been successfully applied in many state-of-the-art CNN models, especially in distributed training scenarios.
the above plot was taken from pytorch-lars
Paper
Large Batch Training of Convolutional Networks
Motivation
Motivation
Training CNNs like ResNet-50 with large batch sizes often leads to instability and slower convergence when using optimizers like SGD or AdamW. LARS (Layer-wise Adaptive Rate Scaling) is designed to address these challenges by applying layer-wise learning rate scaling, ensuring stable, efficient training in large-scale, distributed settings. This enables faster convergence and better performance for large-batch CNN training.
Key benefits of LARS include: