salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.45k stars 193 forks source link

ITM loss #126

Open ghost opened 1 year ago

ghost commented 1 year ago

Hi, thanks again for this great work!

During pre-training phase, for example taking the VG dataset, we have multiple captions corresponding to the same image. It's not clear to me, when you do ITM loss if the same image with different captions happens to appear multiple times in a batch it will become a hard negative example for it but it is actually a valid description for that image, even if as by implementation it will have a label 0, i.e., not a match. Could you please explain the reasoning here? Should we prevent somehow that the same image appears multiple times in the batch to avoid this issue?

HWH-2000 commented 11 months ago

My recent work also has the same problem. Because of the overlap of text or images, the model cannot learn the difference from the negative samples, resulting in the loss of ITC and ITM tasks not converging. Have you solved this problem?

Thanks!