regarding scores on test splt

poojahira / image-captioning-bottom-up-top-down

PyTorch implementation of Image captioning with Bottom-up, Top-down Attention

164 stars 39 forks source link

regarding scores on test splt #11

Closed fawazsammani closed 4 years ago

fawazsammani commented 5 years ago

Hi. I am adopting your code to report scores on Bottom-Up and Top-Down paper. However, when calculating the scores using the coco caption toolkit, there is some difference (especially in the CIDEr score) between the ones you reported and the ones I got. You can see below the attached file and the json file that includes the captions generated on Karpathy Test images, using the trained model you provided. I suppose the problem is from the evaluation toolkit you used? Capture

bottom_up_test.zip

poojahira commented 5 years ago

Hi, yes maybe, I'm not sure. But the difference is quite minor it seems.

ecekt commented 5 years ago

Hi Pooja, when reporting the CIDEr score, do you get what you obtained using NLGEval directly? Or do you multiply it with a factor? I am using a different dataset and got a very different value, so I was not sure how to report it.

poojahira commented 5 years ago

I just multiply by hundred. That's how I've seen it reported in other papers.

Ago3 commented 5 years ago

Hi Pooja, I'm implementing the same project but I'm getting poor results (i.e., ~23.5 as bleu-4). I compared my implementation with yours but the code is almost the same. However, I can't download the bottomup-features (huge file, slow connection) so I'm extracting them myself as suggested in the original paper (but without training on other datasets). Do you think this huge difference in the scores may depend on the image features? Is there any issue you had to face while implementing this project before getting the amazing results you described in the README?

Thanks!

poojahira commented 5 years ago

Hi, that is really surprising. I would expect to find small differences in the scores but that is huge. It might be, as you said, because of the different bottom up features since they used the Visual Genome dataset for training. Based on your results it seems like the specific bottom up features really do make a huge difference.

I do not remember facing any issues like that....I only remember that the BLEU-4 score degraded after passing ~30 epochs.

Ago3 commented 5 years ago

Thank you so much for your answer! I'll try to figure it out..