impiga / Plain-DETR

[ICCV2023] DETR Doesn’t Need Multi-Scale or Locality Design
MIT License
192 stars 3 forks source link

Questions about application to a plain ViT #16

Closed feivelliu closed 9 months ago

feivelliu commented 10 months ago

Very happy to see your code! I am very interested in application to a plain ViT, can you provide some related tips? Thank you so much!

impiga commented 10 months ago

Hi, you could easily create a new ViT backbone class in backbone.py.

Following are some tips:

  1. For the implementation, you could refer to detectron2.
  2. ViT, by default, applies global attention for each layer. To enable window based attention (similar to swin transformer), you could adjust window_size and window_block_indexes (here) options.
  3. Load a MAE pre-trained checkpoint.
  4. Add learning rate decay for ViT. In our existing code, we have defined the get_swin_layer_id function for Swin Tranformer. You could use this as a reference when adding an implementation for ViT. (learning rate decay is a widely adopted trick when finetuning Mask-Image-Modeling pretrained models.)