This is a toy project for myself to start to learn tensorflow.
I started to learn torch by learning from neuraltalk2, so I started my tensorflow with this too.
I think this project is good for those who were familiar with neuraltalk2 in torch, because the main pipeline is almost the same. I don't know if it's a good tutorial to learn tensorflow, because the comments are still limited so far.
Without finetuning on VGG, my code gives CIDEr score ~0.65 on validation set (in 50000 iterations).
Currently if you want to use my code, you need to train the model from scratch (except VGG-16).
Python 2.7
Tensorflow 1.0, please follow the tensorflow website to install the tensorflow.
(Copy from neuraltalk2)
Great, first we need to some preprocessing. Head over to the coco/
folder and run the IPython notebook to download the dataset and do some very simple preprocessing. The notebook will combine the train/val data together and create a very simple and small json file that contains a large list of image paths, and raw captions for each image, of the form:
[{ "file_path": "path/img.jpg", "captions": ["a caption", "a second caption of i"tgit ...] }, ...]
Once we have this, we're ready to invoke the prepro.py
script, which will read all of this in and create a dataset (an hdf5 file and a json file) ready for consumption in the Lua code. For example, for MS COCO we can run the prepro file as follows:
$ python prepro.py --input_json coco/coco_raw.json --num_val 5000 --num_test 5000 --images_root coco/images --word_count_threshold 5 --output_json coco/cocotalk.json --output_h5 coco/cocotalk.h5
This is telling the script to read in all the data (the images and the captions), allocate 5000 images for val/test splits respectively, and map all words that occur <= 5 times to a special UNK
token. The resulting json
and h5
files are about 30GB and contain everything we want to know about the dataset.
Warning: the prepro script will fail with the default MSCOCO data because one of their images is corrupted. See this issue for the fix, it involves manually replacing one image in the dataset.
(Copy end.)
Note that: the split used here can not be used for research. You can email me to ask for preprocessing code for COCO "standard" split, or you can modify the code by yourself if you are familiar.
~~Download or generate a tensorflow version pretrained vgg-16 tensorflow-vgg16. ~~
I borrow the machrisaa/tensorflow-vgg. I made some modification.
training
to control the evaluation and training mode of model (in principle it's controling the dropout probability).You need to download the npy file of vgg, vgg16, vgg19. Put the file somewhere (e.g. a models
directory), and we're ready to train!
$ python train.py --input_json coco/cocotalk.json --input_h5 coco/cocotalk.h5 --checkpoint_path ./log --save_checkpoint_every 2000 --val_images_use 3200
The train script will take over, and start dumping checkpoints into the folder specified by checkpoint_path
(default = current folder). For more options, see opts.py
.
If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use --language_eval 1
option, but don't forget to download the coco-caption code into coco-caption
directory.
A few notes on training. To give you an idea, with the default settings one epoch of MS COCO images is about 7500 iterations. 1 epoch of training (with no finetuning - notice this is the default) takes about 45 minutes and results in validation loss ~2.7 and CIDEr score of ~0.5. By iteration 50,000 CIDEr climbs up to about 0.65 (validation loss at about 2.4).
Now place all your images of interest into a folder, e.g. blah
, and run
the eval script:
$ python eval.py --model model.ckpt-**** --infos_path infos_<id>.pkl --image_folder <image_folder> --num_images 10
This tells the eval
script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing batch_size
(default = 1). Use -num_images -1
to process all images. The eval script will create an vis.json
file inside the vis
folder, which can then be visualized with the provided HTML interface:
$ cd vis
$ python -m SimpleHTTPServer
Now visit localhost:8000
in your browser and you should see your predicted captions.
Beam Search. Beam search is enabled by default because it increases the performance of the search for argmax decoding sequence. However, this is a little more expensive, so if you'd like to evaluate images faster, but at a cost of performance, use --beam_size 1
. For example, in one of my experiments beam size 2 gives CIDEr 0.922, and beam size 1 gives CIDEr 0.886.
Running on MSCOCO images. If you train on MSCOCO (see how below), you will have generated preprocessed MSCOCO images, which you can use directly in the eval script. In this case simply leave out the image_folder
option and the eval script and instead pass in the input_h5
, input_json
to your preprocessed files.
I learned a lot from these following repositories.