Questions about application to a plain ViT

Hi, you could easily create a new ViT backbone class in backbone.py.

Following are some tips:

For the implementation, you could refer to detectron2.
ViT, by default, applies global attention for each layer. To enable window based attention (similar to swin transformer), you could adjust window_size and window_block_indexes (here) options.
Load a MAE pre-trained checkpoint.
Add learning rate decay for ViT. In our existing code, we have defined the get_swin_layer_id function for Swin Tranformer. You could use this as a reference when adding an implementation for ViT. (learning rate decay is a widely adopted trick when finetuning Mask-Image-Modeling pretrained models.)

impiga / Plain-DETR