Closed zcyang closed 9 years ago
Can you provide more details about your setup?
I ran the experiment on coco dataset. I prepared the dataset according to the instructions, the split is the same. For the image, I up sampled the short side of image to 256 and take the center 224 x 224 patch.
I do the training by using the default configuration in evaluate_coco.py. It takes one day to train, which is different from the three days in the paper. It's because it early stops at Epoch 17 and the best score for validation is 27.65. The caption is generated using the default configuration in generate_caps.py as well. Here is the metrics for the val split on coco:
The number of references is 5 {'reflen': 46642, 'guess': [46456, 41456, 36456, 31456], 'testlen': 46456, 'correct': [29418, 13584, 5770, 2454]} ratio: 0.996012177865 {'CIDEr': 0.76125605567615984, 'Bleu_4': 0.22408281090267523, 'Bleu_3': 0.31895835186019617, 'Bleu_2': 0.4536981210368327, 'Bleu_1': 0.630714052536524, 'ROUGE_L': 0.44350833993571775, 'METEOR': 0.21335981552766428}
That time was meant as a conservative upper bound (I think we said less than). There have been a lot of speed ups merged into Theano since we did those experiments 5 months ago. I suspect if you run it again in a few months it will be even faster.
One thing the code doesn't do by default which we talk about in our paper is early-stopping on Bleu. The current setup is using early stopping on NLL. I suspect if you use the model selection criteria we mentioned it will help close that gap.
Let me know if that answers your questions.
Thanks for the comments. I recognized the difference between NLL and bleu score. I tried model with earlier epoch that has larger NLL and the bleu score is high than the best model (by NLL), but the difference is not significant.
Great I'll just close this, if you have any further questions don't hesitate to post them back here.
@zcyang would you mind sharing which GPU are you using?
@zcyang On the Coco validation split, soft attention (deterministic), I don't get as high results as yours. For example, I get CIDer: 0.2731, BLEU-1: 0.5446, METEOR: 0.1634. My final cost is 31.37. I'm trying to understand what is wrong: the features, the dictionary, the way metrics are computed...
I ran the default configuration, it seems there is moderate gap between the result reported in the paper?