Closed Zhangwenyao1 closed 1 year ago
Hi @Zhangwenyao1
Thank you for your message.
In the VPT work, they mainly use prompting on the vision only models, that typically contains a vision backbone followed by a classifier head. So they also tune the head as well along with the prompts.
However in our case, we are dealing with prompting CLIP which is a vision-language model and it does not utilize any head in its architecture. The classification in CLIP is performed by matching the image embeddings with text embeddings using cosine similarity.
So as there is no head classifier as compared to vision only models, we only learn the multimodal prompts and use embedding matching for classification.
Kindly let us know if your query is cleared.
Thank you and kind regards.
Thanks for your reply. By the way, why don't you show your results for few-shots experiments?
Our work is majorly focused on improving generalization of vision-language models and our main comparison is with CoCoOp, which also only provide results on generalization benchmarks.
But feel free to also try MaPLe for few-shot experiments and I am hopeful it would also perform impressive as MaPLe adapts both vision and language branches in a joint fashion in comparison to all previous methods.
Kindly let me know in-case you require any additional information.
Thank you.
I am closing this issue as I believe all your queries are resolved.
Feel free to open or post a new issue in-case you need any further help.
Thanks!
Dear Khattak: In the paper named " Visual Prompt Tuning ", the authors re-train the head and learnable parameters of VPT, but I find that you only train the learnable parameters in your code, I want to know what should I do if I want to train the head and learnable parameters.