How to use Oscar / VinVL for image-text retreival inference?

Hello.

I need to execute simple inference for Image-text retreival, I want a score for image and a caption, as presented here for ViLT.

I've installed the package, and running run_retreival.py

I'm trying to following the instructions in the model zoo. By the way, the checkpoint for the VinVL doesn't work, but I can use Oscar model checkpoint as well.

What should be the --eval_model_dir? You write "# could be base/large models.", I pointed it the the model downloaded from the model zoo.

However, when I ran it, I receive: FileNotFoundError: [Errno 2] No such file or directory: 'datasets/coco_ir/test_captions.pt'

I found the coco_ir download here in the Download page, but it's 20GB and my network doesn't succeed downloading such file. Is it mandatory to run the simple inference?

Is there a simple way to receive a matching score given an image and a possible caption?

Thank you

microsoft / Oscar

How to use Oscar / VinVL for image-text retreival inference? #192