microsoft / Swin-Transformer

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".
https://arxiv.org/abs/2103.14030
MIT License
13.83k stars 2.05k forks source link

How to decide the learning rate for a certain experiment? #11

Closed feiyuhuahuo closed 3 years ago

feiyuhuahuo commented 3 years ago

Hi, I noted that for the 4 object detection frameworks in your paper, you use the same lr setting: AdamW with lr=0.0001. But the base lr settings for them are different. Cascade mask r-cnn: SGD with lr=0.02 ATSS: SGD with lr=0.01 RepPoints v2: SGD with lr=0.01 Sparse RCNN: AdamW with lr=0.000025 Leave the optimizer type alone, how do you decide the lr when using swin-tranformer as the backbone for these 4 frameworks? Seems that your lr has nothing to do with their original ones. This puzzled me. From my point, lr should be adjusted according to the network structure and the loss formation. But you just use the same setting. How to explain this? Any advice, thanks.

ancientmooner commented 3 years ago

AdamW adjusts its actual learning rate automatically according to the statistics of gradients. A good settings can work well across many tasks and frameworks.

feiyuhuahuo commented 3 years ago

Thanks.