Hi, I noted that for the 4 object detection frameworks in your paper, you use the same lr setting: AdamW with lr=0.0001. But the base lr settings for them are different.
Cascade mask r-cnn: SGD with lr=0.02
ATSS: SGD with lr=0.01
RepPoints v2: SGD with lr=0.01
Sparse RCNN: AdamW with lr=0.000025
Leave the optimizer type alone, how do you decide the lr when using swin-tranformer as the backbone for these 4 frameworks? Seems that your lr has nothing to do with their original ones. This puzzled me. From my point, lr should be adjusted according to the network structure and the loss formation. But you just use the same setting. How to explain this? Any advice, thanks.
AdamW adjusts its actual learning rate automatically according to the statistics of gradients. A good settings can work well across many tasks and frameworks.
Hi, I noted that for the 4 object detection frameworks in your paper, you use the same lr setting: AdamW with lr=0.0001. But the base lr settings for them are different. Cascade mask r-cnn: SGD with lr=0.02 ATSS: SGD with lr=0.01 RepPoints v2: SGD with lr=0.01 Sparse RCNN: AdamW with lr=0.000025 Leave the optimizer type alone, how do you decide the lr when using swin-tranformer as the backbone for these 4 frameworks? Seems that your lr has nothing to do with their original ones. This puzzled me. From my point, lr should be adjusted according to the network structure and the loss formation. But you just use the same setting. How to explain this? Any advice, thanks.