ITM loss - Githubissues

salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method

BSD 3-Clause "New" or "Revised" License

1.45k stars 193 forks source link

Hi, thanks again for this great work!

During pre-training phase, for example taking the VG dataset, we have multiple captions corresponding to the same image. It's not clear to me, when you do ITM loss if the same image with different captions happens to appear multiple times in a batch it will become a hard negative example for it but it is actually a valid description for that image, even if as by implementation it will have a label 0, i.e., not a match. Could you please explain the reasoning here? Should we prevent somehow that the same image appears multiple times in the batch to avoid this issue?

salesforce / ALBEF

ITM loss #126