Closed Lycus99 closed 1 year ago
In the paper, you do not describe the similarity matrix size for the alternative images.
Hi!
Only one hard negative image for each example is sampled at runtime.
Basically starting from an "initial" batchsize of 256, you get to a contrastive matrix of 512x1024 (for the image part we have 256 images + 256 hard images, for the text part we have 256 captions + 256 captions from the hard images + 256 hard captions + 256 hard captions from the hard images).
Thanks! I think I understand what you reply. But in the paper, you said K=3 nearest neighbors are added to the batch. What does this mean, please?
Hi!
Only one hard negative image for each example is sampled at runtime.
Basically starting from an "initial" batchsize of 256, you get to a contrastive matrix of 512x1024 (for the image part we have 256 images + 256 hard images, for the text part we have 256 captions + 256 captions from the hard images + 256 hard captions + 256 hard captions from the hard images).
Oh, we sample 1 of the 3 alternatives at each epoch. This means that for each epoch there will be 1 hard image negative for each image
We don't put all the 3 in the batch at the same time.
Thanks for your timely responses. Best wishes.
Nice work! I'd like to know the matrix size for the NegCLIP model training.
As shown in Figure 3, given a batch of N images and N captions, strong alternative images and corresponding captions are also used in this batch. Therefore, I think the current matrix size is (N+3N) (N+3N) = 4N 4N because K=3 nearest neighbors are utilized. Additionally, when adding the negative captions, the matrix size is expanded to 4N (4N+4N) = 4N 8N.
Is my understanding correct? If not, what is the right training process?
Thanks