This is a PyTorch implementation of Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
This code is all written in Python. You will need a GPU to train the model.
You also need to install the following package in order to sucessfully run the code.
You can feel free to choose MSCOCO, Flicker8k or Flicker30k as your dataset.
You might want to use the following command to download the MSCOCO dataset:
wget http://images.cocodataset.org/zips/train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip
We will use Andrej Karpathy's training, validation, and test splits. To download the zip file, you can use the following command:
wget http://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip
In order to preprocess the data on the MSCOCO dataset, you can use the following command:
mkdir coco_folder
python create_input_files.py -d coco -i [YOUR-IMAGE-FLODER]
Use the following command to training the model on MSCOCO dataset:
python train.py -d coco
For comparison, you may also want to train the model with soft attention (paper):
python train.py -d coco -a
You can feel free to choose different beam sizes during evaluation. Use the following command to compute all BLEU (i.e. BLEU-1 to BLEU-4) scores:
python eval.py -d coco -cf [PATH-TO-CHECKPOINT] -b 5
Note that the best checkpoint in training process is based on the BLEU-4 score.
For captioning on your own image, you can use the following command:
python caption.py
If you use this code as part of any published research, please acknowledge the following paper:
@misc{Lu2017Adaptive,
author = {Lu, Jiasen and Xiong, Caiming and Parikh, Devi and Socher, Richard},
title = {Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning},
journal = {CVPR},
year = {2017}
}
The code is developed based on a-PyTorch-Tutorial-to-Image-Captioning.