kohjingyu / fromage

🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".
https://jykoh.com/fromage
Apache License 2.0
466 stars 34 forks source link

Choice of retrieval embedding dimension q = 256 #10

Closed EIFY closed 1 year ago

EIFY commented 1 year ago

First of all thanks for the inspiring work — I have presented and mentioned FROMAGe🧀 a few times at work and in discussions!

One thing came to my mind when I was thinking about this: embedding dimension of OPT-6.7B is 1024 and that of CLIP ViT-L/14 is 768, so choice of retrieval embedding dimension q = 256 seems to be a significant (3x - 4x) bottleneck. Is there a reason for it? Similarly, have you tried changing q?

kohjingyu commented 1 year ago

Thanks for your kind comments, and for sharing the work! 🎉

I tried higher values of q (512 and 1024), but in general validation R@1 gets worse above 256. I believe this is because the batch size that we use for contrastive learning is really small (180, compared to 32K for CLIP), and so a larger embedding dimension is unnecessary, and overfits quickly. I think that if you are able to train it with more GPUs (or the 80GB A100s), increasing both batch size and embedding dimension will likely improve performance.

But for the hyperparameters we used in the paper, 256 seems to be most optimal (as measured by retrieval on the CC3M validation set)

EIFY commented 1 year ago

@kohjingyu BTW I came back here because of the GILL paper and it turned out that the embedding dimension of OPT-6.7B is 4096, not 1024. So the number given here

embedding dimension d = 1024 (inherited from OPT-6.7B).

is wrong. The GILL paper cites the correct embedding dimension 😅

We use the OPT-6.7B [61] model as the LLM backbone (which produce hidden states hθ with embedding
dim e = 4096). 
kohjingyu commented 1 year ago

Thanks so much for pointing this out! I'll fix it in the next version.