Question Related to VIVO Pre-training

Hi,

In VinVL paper, you mention following:

By adding VIVO [9] pre-training, our VinVL improves the original VIVO result by 6 CIDEr points and creates a new SoTA.

As far as I know, VIVO pretrains a transformer model on Object Detection dataset with a masked object prediction task. Then, this model is further trained on Image Captioning dataset.

In your experiments, as I see, you applied the same approach as VIVO for the results in Table 9 in the paper. I could not find codes related to this experiment. Could you point out or share code parts for those experiments?

Thanks.

microsoft / Oscar

Question Related to VIVO Pre-training #176