tgxs002 / CORA

A DETR-style framework for open-vocabulary detection (OVD). CVPR 2023
Apache License 2.0
174 stars 16 forks source link

Question for backbone #10

Open chaos1992 opened 1 year ago

chaos1992 commented 1 year ago

How can I use the clip-vit as the backbone? Which layer of the clip-vit is the 'feature_layer'?

tgxs002 commented 1 year ago

Thank you for your interest in our work! The model is designed for CLIP versions that use ResNet as the backbone. A lot of changes need to be made to make it run for vision transformers. If you want to use the CLIP ViT as the backbone, I guess you need to use the output feature of the last layer.