Closed rock5913 closed 5 days ago
Hi @rock5913,
Thank you for showing interest in MaPLe!
Yes we have experimented with ViT-L/14 model and its performance is better than the ViT-B/16 variant.
In order to use ViT-L/14 model with MaPLe, you would need to do the following:
1) Add the ViT-L/14 model weight URL from openai repository and add it here.
2) Use the ViT-L/14 name in the config file by replacing NAME: "ViT-B/16"
with NAME: "ViT-L/14"
.
3) The text embedding dimension sizes are different for ViT-L/14 variant (768 dimensions of the text encoder embeddings). You need to also change the dimensions of the coupling function in the MaPLe trainer at these lines.
I hope this would be helpful. Thank you and kind regards!
Hi @rock5913,
I am closing the issue now. Feel free to open it if there are still any issues.
Hi! Thank you very much for your excellent article. I noticed that the backbone of image encoder you chose is vit-B16. Have you ever test others like vit-L14, vit-B32, etc? If you did, did vit-L work better than vit-B16? If I want to test on them, what should I do to figure it out? Looking forward to your generous reply from a beginner. Thank you!