Open chaos1992 opened 1 year ago
Thank you for your interest in our work! The model is designed for CLIP versions that use ResNet as the backbone. A lot of changes need to be made to make it run for vision transformers. If you want to use the CLIP ViT as the backbone, I guess you need to use the output feature of the last layer.
How can I use the clip-vit as the backbone? Which layer of the clip-vit is the 'feature_layer'?