text prompt - Githubissues

Hi @xylovezxy,

Thank you for showing interest in MaPLe!

Regarding your query, what you can do is to take the average of the embeddings of the text sentences and then initialize the learnable prompts with those averaged embeddings, instead of using the embeddings for the "a photo of". This better initialization should lead to better results.

Another suggestion: You can also utilize the extra information (non static text descriptions) as an auxiliary loss alongside with MaPLe. Kindly check the text-to-text mapping loss in ProText work. You can then train the model using both supervised loss and this text-to-text loss which will additionally inject the non-static text description information into the model for better performance.

muzairkhattak / multimodal-prompt-learning

text prompt #74