Grounding script vs retrival

The grounding script has an identical train function to the retrieval and it uses the same retrieval model apart from the fact that it is trained on refcoco rather than mscoco .

The loss functions are also the same. I understand that it is weakly supervised grounding but what are the differences (apart from dataset and evaluation code). The only difference is that in the loss functions, the idx in retrieval will be image idx and in grounding object idx . So the code is applicable to both ? is that correct ?

salesforce / ALBEF

Grounding script vs retrival #103