This is the PyTorch implementation of Are scene graphs good enough to improve Image Captioning?. Training and evaluation is done on the MSCOCO Image captioning challenge dataset. Bottom up features for MSCOCO dataset are extracted using Faster R-CNN object detection model trained on Visual Genome dataset. Pretrained bottom-up features are downloaded from here.
This Repository is designed with every different model design in a different branch. The name of the branch indicates what the model design is. Iti s best to avoid the Main branch currently, since this is outdated.
Create a folder called 'data'
Create a folder called 'final_dataset'
Download the MSCOCO Training (13GB) and Validation (6GB) images.
Also download Andrej Karpathy's training, validation, and test splits. This zip file contains the captions.
Unzip all files and place the folders in 'data' folder.
Next, download the bottom up image features. We used the fixed 36 regions version.
Unzip the folder and place unzipped folder in 'bottom-up_features' folder.
Next type this command in a python environment:
python bottom-up_features/tsv.py
This command will create the following files -
optionally for the scene graphs, also run the following:
python create_input_files.py
this will create the following similar HDF5 and PKL files.
Move these files to the folder 'final_dataset'.
Next, type this command. If you dont want to prepare the scene-graph features, remove the -s flag:
python create_input_files.py -s
This command will create the following files -
Although we make use of the official COCO captioning evaluation scripts, for legacy kept the nl_eval_master folder.
Next, go to nlg_eval_master folder and type the following two commands:
pip install -e .
nlg-eval --setup
This will install all the files needed for evaluation.
To train the bottom-up top down model, type:
python train.py
To evaluate the model on the karpathy test split, edit the eval.py file to include the model checkpoint location and then type:
python eval.py
Beam search is used to generate captions during evaluation. Beam search iteratively considers the set of the k best sentences up to time t as candidates to generate sentences of size t + 1, and keeps only the resulting best k of them. A beam search of five is used for inference.
The metrics reported are ones used most often in relation to image captioning and include BLEU-4, CIDEr, METEOR and ROUGE-L. Official MSCOCO evaluation scripts are used for measuring these scores.
Code adapted with thanks from https://github.com/poojahira/image-captioning-bottom-up-top-down