This code implements a novel approach for training image-text alignment models, namely ADAPT.
ADAPT is designed to adjust an intermediate representation of instances from a modality a using an embedding vector of an instance from modality b. Such an adaptation is designed to filter and enhance important information across internal features, allowing for guided vector representations – which resembles the working of attention modules, though far more computationally efficient. For further information, please read our AAAI 2020 paper.
We don't provide support for python 2. We advise you to install python 3 with Anaconda and then create an environment.
conda create --name adapt python=3
conda activate adapt
git clone https://github.com/jwehrmann/retrieval.pytorch
cd retrieval.pytorch
pip install -r requirements.txt
wget https://scanproject.blob.core.windows.net/scan-data/data.zip
conda activate adapt
export DATA_PATH=/path/to/dataset
You can also create a shell alias (shortcut to reference a command). For example, add this command to your shell profile:
alias adapt='source activate adapt && export DATA_PATH=/path/to/dataset'
And then only run the declared name of the alias to have everything configured:
$ adapt
You can reproduce our main results using the following scripts.
Training on Flickr30k:
python run.py options/adapt/f30k/t2i.yaml
python test.py options/adapt/f30k/t2i.yaml -data_split test
python run.py options/adapt/f30k/i2t.yaml
python test.py options/adapt/f30k/i2t.yaml -data_split test
Training on MS COCO:
python run.py options/adapt/coco/t2i.yaml
python test.py options/adapt/coco/t2i.yaml -data_split test
python run.py options/adapt/coco/i2t.yaml
python test.py options/adapt/coco/i2t.yaml -data_split test
To ensemble multiple models (ADAPT-Ens) one can use:
MS COCO models:
python test_ens.py options/adapt/coco/t2i.yaml options/adapt/coco/i2t.yaml -data_split test
F30k models:
python test_ens.py options/adapt/f30k/t2i.yaml options/adapt/f30k/i2t.yaml -data_split test
We make available all the main models generated in this research. Each file has the best model of the run (according to validation result), the last checkpoint generated, all tensorboard logs (loss and recall curves), result files, and configuration options used for training.
Dataset | Model | Image Annotation R@1 | Image Retrieval R@1 |
---|---|---|---|
F30k | ADAPT-t2i | 76.4% | 57.8% |
F30k | ADAPT-i2t | 66.3% | 53.8% |
F30k | ADAPT-ens | 76.2% | 60.5% |
COCO | ADAPT-t2i | 75.4% | 64.0% |
COCO | ADAPT-i2t | 67.2% | 57.8% |
COCO | ADAPT-ens | 75.3% | 64.4% |
If you find this research or code useful, please consider citing our paper:
@article{wehrmanna2020daptive,
title={Adaptive Cross-modal Embeddings for Image-Text Alignment},
author={Wehrmann, J{\^o}natas and Kolling, Camila and Barros, Rodrigo C},
booktitle={The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020)},
year={2020}
}