raoyongming / DenseCLIP

[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
505 stars 38 forks source link

question about how to use vit backbone in detection #43

Closed tandangzuoren closed 8 months ago

tandangzuoren commented 1 year ago

Hi thank you for your good work.

I tried in vit structure used in the mmdetection, i notice that vit part used detection/denseclip/denseclip.py class DenseCLIP_MaskRCNN, but when I add vit config, I could not run it correctly, There will be some dimension mismatch or other incompatibility problems, it seems that the detection of vit version is not complete. Can you give me some suggestions? If I want to use vit as backbone and denseclip on coco, can I use denseclip.py from segmentation? Or can you provide vit detection configuration? Or am I using it the wrong way?

Thanks!

raoyongming commented 11 months ago

Hi, sorry for the late response. Since the image resolution used in detection is much higher than in segmentation, we found it is not trivial to directly adapt CLIP ViT to detection tasks. We didn't use the CLIP ViT for detection in our final models. It may need some architectural changes if you want to use it in detection. You may find some useful information in ViTDet.