matrix size for contrastive learning in model training

mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023

MIT License

222 stars 14 forks source link

matrix size for contrastive learning in model training #19

Closed Lycus99 closed 1 year ago

Lycus99 commented 1 year ago

Nice work! I'd like to know the matrix size for the NegCLIP model training. As shown in Figure 3, given a batch of N images and N captions, strong alternative images and corresponding captions are also used in this batch. Therefore, I think the current matrix size is (N+3N) (N+3N) = 4N 4N because K=3 nearest neighbors are utilized. Additionally, when adding the negative captions, the matrix size is expanded to 4N (4N+4N) = 4N 8N. Is my understanding correct? If not, what is the right training process? Thanks

Lycus99 commented 1 year ago

In the paper, you do not describe the similarity matrix size for the alternative images.

vinid commented 1 year ago

Hi!

Only one hard negative image for each example is sampled at runtime.

Basically starting from an "initial" batchsize of 256, you get to a contrastive matrix of 512x1024 (for the image part we have 256 images + 256 hard images, for the text part we have 256 captions + 256 captions from the hard images + 256 hard captions + 256 hard captions from the hard images).

Lycus99 commented 1 year ago

Thanks! I think I understand what you reply. But in the paper, you said K=3 nearest neighbors are added to the batch. What does this mean, please?

Hi!

Only one hard negative image for each example is sampled at runtime.

Basically starting from an "initial" batchsize of 256, you get to a contrastive matrix of 512x1024 (for the image part we have 256 images + 256 hard images, for the text part we have 256 captions + 256 captions from the hard images + 256 hard captions + 256 hard captions from the hard images).

vinid commented 1 year ago

Oh, we sample 1 of the 3 alternatives at each epoch. This means that for each epoch there will be 1 hard image negative for each image

We don't put all the 3 in the batch at the same time.

Lycus99 commented 1 year ago

Thanks for your timely responses. Best wishes.