A minor question about VPT/IndependentVL.

TobyZack commented 4 months ago

Dear Authors,

Thank you for your contribution.

May I ask where you utilize visual prompts in VPT/IVLP? Upon reviewing the prompts learners for two classes, I observed only FixedEmbeddings in VPT and solely textual prompts in IVLP.

i.e. in VPT forward function:

        text_features = self.embeddings.return_fixed_embeddings().cuda()
        image_features = self.image_encoder(image.type(self.dtype))

i.e. in IVLP forward function:

        text_features = self.text_encoder(prompts, tokenized_prompts)
        image_features = self.image_encoder(image.type(self.dtype))

If my understanding is correct, only textual prompts are employed in these forward function, rather than any visual prompts.

in contrast, there are visual prompts (shared_ctx, deep_compound_prompts_vision) in the maple forward function:

        text_features = self.text_encoder(prompts, tokenized_prompts, deep_compound_prompts_text)
        image_features = self.image_encoder(image.type(self.dtype), shared_ctx, deep_compound_prompts_vision)

As I am new to the VLM field, this question might seem naive. However, any assistance would be greatly appreciated. Best regards.

TobyZack commented 4 months ago

I seems find the definition, unlike the prompts leaner on MaPLe, are the visual prompts on Resblocks? If that so, I think my concerns have been addressed.

muzairkhattak commented 4 months ago

Dear @TobyZack,

Thank you for showing interest in MaPLe!

Sorry for the delayed response. Regarding your question, yes you are right. For IVLP and VPT, the visual prompts are employed inside the Resblocks in model.py of CLIP.

Kind regards!

muzairkhattak / multimodal-prompt-learning

A minor question about VPT/IndependentVL. #51