By adding VIVO [9] pre-training, our VinVL improves the original VIVO result
by 6 CIDEr points and creates a new SoTA.
As far as I know, VIVO pretrains a transformer model on Object Detection dataset with a masked object prediction task. Then, this model is further trained on Image Captioning dataset.
In your experiments, as I see, you applied the same approach as VIVO for the results in Table 9 in the paper. I could not find codes related to this experiment. Could you point out or share code parts for those experiments?
Hi,
In VinVL paper, you mention following:
As far as I know, VIVO pretrains a transformer model on Object Detection dataset with a masked object prediction task. Then, this model is further trained on Image Captioning dataset.
In your experiments, as I see, you applied the same approach as VIVO for the results in Table 9 in the paper. I could not find codes related to this experiment. Could you point out or share code parts for those experiments?
Thanks.