kelvinxu / arctic-captions

960 stars 349 forks source link

cannot replicate the result in the paper? #5

Closed zcyang closed 9 years ago

zcyang commented 9 years ago

I ran the default configuration, it seems there is moderate gap between the result reported in the paper?

kelvinxu commented 9 years ago

Can you provide more details about your setup?

zcyang commented 9 years ago

I ran the experiment on coco dataset. I prepared the dataset according to the instructions, the split is the same. For the image, I up sampled the short side of image to 256 and take the center 224 x 224 patch.

I do the training by using the default configuration in evaluate_coco.py. It takes one day to train, which is different from the three days in the paper. It's because it early stops at Epoch 17 and the best score for validation is 27.65. The caption is generated using the default configuration in generate_caps.py as well. Here is the metrics for the val split on coco:

The number of references is 5 {'reflen': 46642, 'guess': [46456, 41456, 36456, 31456], 'testlen': 46456, 'correct': [29418, 13584, 5770, 2454]} ratio: 0.996012177865 {'CIDEr': 0.76125605567615984, 'Bleu_4': 0.22408281090267523, 'Bleu_3': 0.31895835186019617, 'Bleu_2': 0.4536981210368327, 'Bleu_1': 0.630714052536524, 'ROUGE_L': 0.44350833993571775, 'METEOR': 0.21335981552766428}

kelvinxu commented 9 years ago

That time was meant as a conservative upper bound (I think we said less than). There have been a lot of speed ups merged into Theano since we did those experiments 5 months ago. I suspect if you run it again in a few months it will be even faster.

One thing the code doesn't do by default which we talk about in our paper is early-stopping on Bleu. The current setup is using early stopping on NLL. I suspect if you use the model selection criteria we mentioned it will help close that gap.

Let me know if that answers your questions.

zcyang commented 9 years ago

Thanks for the comments. I recognized the difference between NLL and bleu score. I tried model with earlier epoch that has larger NLL and the bleu score is high than the best model (by NLL), but the difference is not significant.

kelvinxu commented 9 years ago

Great I'll just close this, if you have any further questions don't hesitate to post them back here.

xlhdh commented 9 years ago

@zcyang would you mind sharing which GPU are you using?

frajem commented 8 years ago

@zcyang On the Coco validation split, soft attention (deterministic), I don't get as high results as yours. For example, I get CIDer: 0.2731, BLEU-1: 0.5446, METEOR: 0.1634. My final cost is 31.37. I'm trying to understand what is wrong: the features, the dictionary, the way metrics are computed...