The program requires the following dependencies:
We are using COCO Caption Evaluation library, which uses the Stanford CoreNLP 3.6.0 toolset
cd external/coco-caption
./get_stanford_models.sh
export PYTHONPATH=./external/coco-caption
Pre-process UC Merced images and captions
./preprocess_captions.sh uc-merced
./preprocess_images.sh uc-merced
Add/Replace files to fairseq 0.9.0 from fairseq
Hyperparameters need to be tuned. This is just an example.
python -m fairseq_cli.train \
--save-dir .checkpoints \
--user-dir task \
--task captioning \
--arch default-captioning-arch \
--encoder-layers 3 \
--decoder-layers 6 \
--features obj \
--feature-spatial-encoding \
--optimizer adam \
--adam-betas "(0.9,0.999)" \
--lr 0.0003 \
--lr-scheduler inverse_sqrt \
--min-lr 1e-09 \
--warmup-init-lr 1e-8 \
--warmup-updates 8000 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--weight-decay 0.0001 \
--dropout 0.3 \
--max-epoch 25 \
--max-tokens 4096 \
--max-source-positions 100 \
--encoder-embed-dim 512 \
--num-workers 2
To generate captions for images in test-split
python generate.py \
--user-dir task \
--features grid \
--tokenizer moses \
--bpe subword_nmt \
--bpe-codes output/codes.txt \
--beam 5 \
--split test \
--path .checkpoints-scst/checkpoint24.pt \
--input output/test-ids.txt \
--output output/test-predictions.json \
--output_l output/test-labels-preds.csv
The following example calculates metrics for captions contained in
output/test-predictions.json
.
./score.sh \
--reference-captions external/coco-caption/annotations/captions_val2014.json \
--system-captions output/test-predictions.json
The following example calculates metrics for labels contained in
output/test-labels-preds.csv
.
python score_label.py
--reference-captions output/label_preds.csv \
--system-captions output/test-labels-preds.csv
The trained multi-task model for image captioning with multi-label classification can be downloaded from here
Image | Caption |
---|---|
Ground truth Caption: This is a part of a golf course with green turfs and some bunkers and trees . Caption w/o multi-label: green turfs and some bunkers and withered trees in the golf course. Caption with multi-label: this is a part of a golf course with green turfs and some bunkers and trees. |
|
Ground truth Caption: There are two tennis courts arranged neatly and surrounded by some plants . Caption w/o multi-label: four tennis courts arranged neatly with some plants surrounded. Caption with multi-label: there are two tennis courts arranged neatly and surrounded by some plants. |
|
Ground truth Caption: Two straight freeways parallel forward with some cars on them . Caption w/o multi-label: some cars are on the freeways. Caption with multi-label: two straight freeways closed to each other with some cars on them. |
|
Ground truth Caption: Two airplanes are stopped at the airport . Caption w/o multi-label: an airplane is stopped at the airport. Caption with multi-label: two airplanes are stopped at the airport. |
|
Ground truth Caption: Many mobile homes are closed to each other with some cars parked at the roadside in the mobile home park . Caption w/o multi-label: lots of mobile homes with plants surrounded in the mobile home park. Caption with multi-label: many houses arranged neatly with plants surrounded in the medium residential area. |
|
Ground truth Caption: An intersection with a road cross over the other roads . Caption w/o multi-label: an overpass go across the roads diagonally with lawn surounded. Caption with multi-label: an overpass with a road go across another roads diagonally with some cars on the roads. |
Image | Caption |
---|---|
Ground truth Caption: This is a part of a golf course with green turfs and some bunkers and trees . Caption with angle prediction: a part of a golf course with green turfs and some bunkers and a trail cross the turfs. Caption with reconstruction: this is a part of a golf course with green turfs and some trees. |
|
Ground truth Caption: There are two tennis courts arranged neatly and surrounded by some plants . Caption with angle prediction: there are six tennis courts arranged neatly and surrounded by some buildings. Caption with reconstruction: this is a sparse residential area with a villa surrounded by trees. |
|
Ground truth Caption: Two straight freeways parallel forward with some cars on them . Caption with angle prediction: two straight freeways with some cars on them. Caption with reconstruction: an overpass with a road go across another roads diagonally with some cars on the roads. |
|
Ground truth Caption: Two airplanes are stopped at the airport . Caption with angle prediction: it is a purple airplane stopped at the airport. Caption with reconstruction: an airplane is stopped at the airport and the ground is dark. |
|
Ground truth Caption: Many mobile homes are closed to each other with some cars parked at the roadside in the mobile home park . Caption with angle prediction: many houses arranged in lines in the dense residential area. Caption with reconstruction: lots of mobile homes with plants surrounded in the mobile home park. |
|
Ground truth Caption: An intersection with a road cross over the other roads . Caption with angle prediction: an overpass go across the roads with some cars on the roads. Caption with reconstruction: an overpass with a road go across another roads diagonally with some cars on it. |
Codebase inspired from https://github.com/krasserm/fairseq-image-captioning
If you find this code useful for your research, please cite our paper:
@article{kandala2022exploring,
title={Exploring Transformer and multi-label classification for remote sensing image captioning},
author={Kandala, Hitesh and Saha, Sudipan and Banerjee, Biplab and Zhu, Xiao Xiang},
journal={IEEE Geoscience and Remote Sensing Letters},
year={2022},
publisher={IEEE}
}