huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.64k stars 26.21k forks source link

Add MultiStepLR with Warmup Scheduler #31831

Open penguinwang96825 opened 2 months ago

penguinwang96825 commented 2 months ago

Feature request

I would like to propose the addition of a new learning rate scheduler that combines MultiStepLR with a warmup phase. Currently, the Transformers library does not include a scheduler that uses both MultiStepLR and warmup. This feature can be beneficial for training models where the learning rate needs to be adjusted at specific epochs with an initial warmup phase to stabilise training.

Motivation

In many training scenarios, it is beneficial to start with a warmup phase where the learning rate gradually increases, followed by a phase where the learning rate decreases at specific milestones (steps).

Contribution

I propose adding a new scheduler, get_multistep_schedule_with_warmup, which combines the functionality of MultiStepLR and Warmup. This scheduler will increase the learning rate linearly during the warmup phase and then follow the MultiStepLR schedule. I am more than happy to create a pull request (PR) implementing this feature. Please let me know if this sounds like a valuable addition, and I will proceed with the implementation.

amyeroberts commented 1 month ago

cc @muellerzr @SunMarc

muellerzr commented 1 month ago

Hi! In general we prefer if you can provide tangible results of improvement via either your own work or a paper referencing it. Can you link any please? thanks!

penguinwang96825 commented 1 month ago

@muellerzr Thanks for the prompt reply! Sure I present it in detail below.

Popularity and Practical Use

The MultiStepLR scheduler is widely used and recognized for its effectiveness in practice, as evidenced by its popularity among PyTorch users. According to Defazio et al. (2023), it is one of the top three most popular schedulers. This piece-wise approach to decreasing the learning rate when progress plateaus has proven to be effective. Many studies incorporate this scheduler as a default choice for learning rate adjustment (Sohn et al., 2016; Wang et al., 2017; Gong et al., 2021).

PyTorch Scheduler GitHub Files (K)
ReduceLROnPlateau 105.0
StepLR 101.0
MultiStepLR 37.9
CosineAnnealingLR 37.1
ExponentialLR 16.0
OneCycleLR 14.9
CosineAnnealingWarmRestarts 10.9
CyclicLR 9.1
LinearLR 5.9
ConstantLR 3.6
MultiplicativeLR 2.6
PolynomialLR 1.3

References

Defazio, Aaron, et al. "When, why and how much? adaptive learning rate scheduling by refinement." arXiv preprint arXiv:2310.07831 (2023).

Gong, Yuan, Yu-An Chung, and James Glass. "Ast: Audio spectrogram transformer." Proc. Interspeech (2021).

Wang, Jian, et al. "Deep metric learning with angular loss." Proceedings of the IEEE international conference on computer vision. (2017).

Sohn, Kihyuk. "Improved deep metric learning with multi-class n-pair loss objective." Advances in neural information processing systems 29 (2016).