raoyongming / DenseCLIP

[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
505 stars 38 forks source link

dim unsigned #10

Closed Ahnsun closed 2 years ago

Ahnsun commented 2 years ago

Hi, I used the pre_trained model ViT-B-16.pt and the config retinanet_clip_r101_fpn_1x_coco.py. However, the embed_dim of CLIPTextContextEncoder is 1024 while the embeded_dim of pretrained model is 512.

raoyongming commented 2 years ago

Hi, the config retinanet_clip_r101_fpn_1x_coco.py is used to train the RetinaNet model with the CLIP ResNet-101 backbone. You should modify the dimensions and backbone architectures if you want to use the config to train the ViT-B model. Besides, it is difficult to directly use the ViT-B model on detection tasks due to the quadratic complexity of self-attention and the large image sizes used in detection. Therefore, we didn't test the ViT-B model on COCO in our paper.

Ahnsun commented 2 years ago

Get it, thanks a lot!