muzairkhattak / multimodal-prompt-learning

[CVPR 2023] Official repository of paper titled "MaPLe: Multi-modal Prompt Learning".
https://muzairkhattak.github.io/multimodal-prompt-learning/
MIT License
578 stars 42 forks source link

text prompt #74

Open xylovezxy opened 2 weeks ago

xylovezxy commented 2 weeks ago

Hello!What should I do if I want to use non template text for fine-tuning?Each sentence is completely different, not just a photo of a cat.

muzairkhattak commented 5 days ago

Hi @xylovezxy,

Thank you for showing interest in MaPLe!

Regarding your query, what you can do is to take the average of the embeddings of the text sentences and then initialize the learnable prompts with those averaged embeddings, instead of using the embeddings for the "a photo of". This better initialization should lead to better results.

Another suggestion: You can also utilize the extra information (non static text descriptions) as an auxiliary loss alongside with MaPLe. Kindly check the text-to-text mapping loss in ProText work. You can then train the model using both supervised loss and this text-to-text loss which will additionally inject the non-static text description information into the model for better performance.