Implementation of MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding.
Some of our code is based on ban-vqa. Thanks!
TODO provide Faster R-CNN feature extraction script.
We use flickr30k dataset to train and validate our model.
the raw dataset can be found at Flickr30k Entites Annotations
Run
sh tools/prepare_data.sh
to downloaded and process Flickr30k Annotations, Images and Glove word embeddings.
We use an off-the-shelf faster-rcnn pretrained on Visual Genome to generate objects proposals and labels. We use Bottom-Up Attention for visual features.
As Issue#1 pointed out, there is some inconsistency between features generated using our script (faster-rcnn) and Bottom-Up Attention. We therefore upload our generated features.
Download train_features_compress.hdf5(6GB), val features_compress.hdf5, and test features_compress.hdf5 to data/flickr30k
.
alternative link for train_feature.hdf5 (20GB, same features): google drive; baidu drive, code: n1yd.
Download train_detection_dict.json, val_detection_dict.json, and test_detection_dict.json and to data/
.
run sh tools/prepare_detection.sh
to clone faster-rcnn code and download pre-trained models.
run sh tools/run_faster_rcnn.sh
to run faster-rcnn detection on flickr30k dataset and generate features.
you may have to customize your environment in order to run faster-rcnn successfully. See prerequisites
python main.py [args]
In our experiments, we get a ~61% accuracy using the default setting.
Our trained model can be downloaded at google drive.
python test.py --file <saved model>