Multimodal Alignment Framework

Implementation of MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding.

Some of our code is based on ban-vqa. Thanks!

TODO provide Faster R-CNN feature extraction script.

Prerequisites

python 3.7
pytorch 1.4.0

Data

Flickr30k Entities

We use flickr30k dataset to train and validate our model.

the raw dataset can be found at Flickr30k Entites Annotations

Run sh tools/prepare_data.sh to downloaded and process Flickr30k Annotations, Images and Glove word embeddings.

Object proposals

Donwload object proposals:

We use an off-the-shelf faster-rcnn pretrained on Visual Genome to generate objects proposals and labels. We use Bottom-Up Attention for visual features.

As Issue#1 pointed out, there is some inconsistency between features generated using our script (faster-rcnn) and Bottom-Up Attention. We therefore upload our generated features.

Download train_features_compress.hdf5(6GB), val features_compress.hdf5, and test features_compress.hdf5 to data/flickr30k.

alternative link for train_feature.hdf5 (20GB, same features): google drive; baidu drive, code: n1yd.

Download train_detection_dict.json, val_detection_dict.json, and test_detection_dict.json and to data/.

Generate object proposals by yourself(TODO)

~~run sh tools/prepare_detection.sh to clone faster-rcnn code and download pre-trained models.~~

~~run sh tools/run_faster_rcnn.sh to run faster-rcnn detection on flickr30k dataset and generate features.~~

you may have to customize your environment in order to run faster-rcnn successfully. See prerequisites

Training

python main.py [args]

In our experiments, we get a ~61% accuracy using the default setting.

Evaluating

Our trained model can be downloaded at google drive.

python test.py --file <saved model>

qinzzz / Multimodal-Alignment-Framework

readme