ruotianluo / DiscCaptioning

Code for Discriminability objective for training descriptive captions(CVPR 2018)
110 stars 21 forks source link

Discriminability objective for training descriptive captions

This is the implementation of paper Discriminability objective for training descriptive captions.

Requirements

Python 2.7 (because there is no coco-caption version for python 3)

PyTorch 1.0 (along with torchvision)

java 1.8 for (coco-caption)

Downloads

Clone the repository

git clone --recursive https://github.com/ruotianluo/DiscCaptioning.git

Data split

In this paper we use the data split from Context-aware Captions from Context-agnostic Supervision. It's different from standard karpathy's split, so we need to download different files.

Download link: Google drive link

To train on your own, you only need to download dataset_coco.json, but it's also suggested to download cocotalk.json and cocotalk_label.h5 as well. If you want to run pretrained model, you have to download all three files.

coco-caption

cd coco-caption
bash ./get_stanford_models.sh
cd annotations
# Download captions_val2014.json from the google drive link above to this folder
cd ../../

The reason why we need to replace the captions_val2014.json is because the original file can only evaluate images from the val2014 set, and we are using rama's split.

Pre-computed feature

In this paper, for retrieval model, we use outputs of last layer of resnet-101. For captioning model, we use the bottom-up feature from https://arxiv.org/abs/1707.07998.

The features can be downloaded from the same link, and you need to compress them to data/cocotalk_fc and data/cocobu_att respectively.

Pretrained models.

Download pretrained models from link. Decompress them into root folder.

To evaluate on pretrained model, run:

bash eval.sh att_d1 test

The pretrained models can match the results shown in the paper.

Train on you rown

Preprocessing

Preprocess the captions (skip if you already have 'cocotalk.json' and 'cocotalk_label.h5'):

$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk

Preprocess for self-critical training:

$ python scripts/prepro_ngrams.py --input_json data/dataset_coco.json --dict_json data/cocotalk.json --output_pkl data/coco-train --split train

Start training

First train a retrieval model:

bash run_fc_con.sh

Second, pretrain the captioning model.

bash run_att.sh

Third, finetune the captioning model with cider+discriminability optimization:

bash run_att_d.sh 1 (1 is the discriminability weight, and can be changed to other values)

Evaluate

bash eval.sh att_d1 test

Citation

If you found this useful, please consider citing:

@InProceedings{Luo_2018_CVPR,
author = {Luo, Ruotian and Price, Brian and Cohen, Scott and Shakhnarovich, Gregory},
title = {Discriminability Objective for Training Descriptive Captions},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2018}
}

Acknowledgements

The code is based on ImageCaptioning.pytorch