muzairkhattak / multimodal-prompt-learning

[CVPR 2023] Official repository of paper titled "MaPLe: Multi-modal Prompt Learning".
https://muzairkhattak.github.io/multimodal-prompt-learning/
MIT License
578 stars 42 forks source link

A minor question about VPT/IndependentVL. #51

Closed TobyZack closed 4 months ago

TobyZack commented 4 months ago

Dear Authors,

Thank you for your contribution.

May I ask where you utilize visual prompts in VPT/IVLP? Upon reviewing the prompts learners for two classes, I observed only FixedEmbeddings in VPT and solely textual prompts in IVLP.

i.e. in VPT forward function:

        text_features = self.embeddings.return_fixed_embeddings().cuda()
        image_features = self.image_encoder(image.type(self.dtype))

i.e. in IVLP forward function:

        text_features = self.text_encoder(prompts, tokenized_prompts)
        image_features = self.image_encoder(image.type(self.dtype))

If my understanding is correct, only textual prompts are employed in these forward function, rather than any visual prompts.

in contrast, there are visual prompts (shared_ctx, deep_compound_prompts_vision) in the maple forward function:

        text_features = self.text_encoder(prompts, tokenized_prompts, deep_compound_prompts_text)
        image_features = self.image_encoder(image.type(self.dtype), shared_ctx, deep_compound_prompts_vision)

As I am new to the VLM field, this question might seem naive. However, any assistance would be greatly appreciated. Best regards.

TobyZack commented 4 months ago

I seems find the definition, unlike the prompts leaner on MaPLe, are the visual prompts on Resblocks? If that so, I think my concerns have been addressed.

muzairkhattak commented 4 months ago

Dear @TobyZack,

Thank you for showing interest in MaPLe!

Sorry for the delayed response. Regarding your question, yes you are right. For IVLP and VPT, the visual prompts are employed inside the Resblocks in model.py of CLIP.

Kind regards!