Open xylovezxy opened 2 weeks ago
Hi @xylovezxy,
Thank you for showing interest in MaPLe!
Regarding your query, what you can do is to take the average of the embeddings of the text sentences and then initialize the learnable prompts with those averaged embeddings, instead of using the embeddings for the "a photo of". This better initialization should lead to better results.
Another suggestion: You can also utilize the extra information (non static text descriptions) as an auxiliary loss alongside with MaPLe. Kindly check the text-to-text mapping loss in ProText work. You can then train the model using both supervised loss and this text-to-text loss which will additionally inject the non-static text description information into the model for better performance.
Hello!What should I do if I want to use non template text for fine-tuning?Each sentence is completely different, not just a photo of a cat.