hustvl / MIMDet

[ICCV 2023] You Only Look at One Partial Sequence
https://arxiv.org/abs/2204.02964
MIT License
336 stars 31 forks source link

Some implementation detail questions #14

Closed Yingdong-Hu closed 2 years ago

Yingdong-Hu commented 2 years ago

Hi, I have some detail questions about Benchmarking-ViT-B to ask.

  1. Absolute position embeddings You use sincos_pos_embed=True and freeze the embedding. In the original paper, the authors transfer the pre-trained absolute position embeddings (actually sincos embedding) for MAE and randomly initialize the absolute position embeddings for BEiT. And both of them seems trainable. Why use freezed absolute position embeddings here?

  2. Relative position biases BEiT uses relative position biases during pre-training. If the linear interpolation is used to adopt the relative position biases to higher resolution, the performance significantly degraded on semantic segmentation task, they use a more sophisticated interpolation algorithm. This code just uses linear interpolation, does this affect performance on detection task?https://github.com/hustvl/MIMDet/blob/9e1dea10fd5eb26567cb2bac51f2b652d81620b9/models/benchmarking.py#L547

  3. The config to use BEiT initialization If I were to use BEiT, how to modify the config file? What I am sure about is to modify init_values=0.1, beit_qkv_bias=True. But I'm not sure if sincos_pos_embed=False And how to resize relative position biases to higher resolution, if only linear interpolation is used, will it degrade the performance? Is there anything else that needs to be modified in the config file?

vealocia commented 2 years ago

Hi, @Alxead ! Thanks for your interest in our work. Q1: In our implementation, if a weight is frozen in pre-training (i.e., position embeddings in MAE & MoCoV3, patch embed in MoCoV3), we also freeze it in our COCO fine-tuning process. The impact on final performance is still unclear because we haven't conducted experiments here. Q2 & Q3: We haven't try BEiT's initialization and also BEiT's interpolation method, but in our experiments, simple linear interpolation for relative position bias works well on SimMIM pre-trained weight (it obtains 48.7 Bbox AP with 25 epoch training). Here're our configurations for SimMIM pre-trained weight, hope these can help you!

model.backbone.bottom_up.vit.stop_grad_conv1 = False
model.backbone.bottom_up.vit.sincos_pos_embed = False
model.backbone.bottom_up.vit.init_values = 0.1  # simmim initialized model with layerscale
model.backbone.bottom_up.vit.beit_qkv_bias = True

optimizer.weight_decay = 0.1
optimizer.lr = 8e-5

Maybe we will conduct some supplementary experiments to find out the effects of learnable pos embed and BEiT's interpolation method recently. If you have any conclusion in your experiments, please let me know and hope to discuss with you.

Yuxin-CV commented 2 years ago

I believe the issue at hand was addressed, as such I'm closing this. Feel free to ask if you have further questions.