about backbone - Githubissues

rock5913 commented 2 weeks ago

Hi! Thank you very much for your excellent article. I noticed that the backbone of image encoder you chose is vit-B16. Have you ever test others like vit-L14, vit-B32, etc? If you did, did vit-L work better than vit-B16? If I want to test on them, what should I do to figure it out? Looking forward to your generous reply from a beginner. Thank you!

muzairkhattak commented 2 weeks ago

Hi @rock5913,

Thank you for showing interest in MaPLe!

Yes we have experimented with ViT-L/14 model and its performance is better than the ViT-B/16 variant.

In order to use ViT-L/14 model with MaPLe, you would need to do the following:

1) Add the ViT-L/14 model weight URL from openai repository and add it here.

2) Use the ViT-L/14 name in the config file by replacing NAME: "ViT-B/16" with NAME: "ViT-L/14".

3) The text embedding dimension sizes are different for ViT-L/14 variant (768 dimensions of the text encoder embeddings). You need to also change the dimensions of the coupling function in the MaPLe trainer at these lines.

I hope this would be helpful. Thank you and kind regards!

muzairkhattak commented 5 days ago

Hi @rock5913,

I am closing the issue now. Feel free to open it if there are still any issues.

muzairkhattak / multimodal-prompt-learning

about backbone #73