Closed fawazsammani closed 4 years ago
Hi, yes maybe, I'm not sure. But the difference is quite minor it seems.
Hi Pooja, when reporting the CIDEr score, do you get what you obtained using NLGEval directly? Or do you multiply it with a factor? I am using a different dataset and got a very different value, so I was not sure how to report it.
I just multiply by hundred. That's how I've seen it reported in other papers.
Hi Pooja, I'm implementing the same project but I'm getting poor results (i.e., ~23.5 as bleu-4). I compared my implementation with yours but the code is almost the same. However, I can't download the bottomup-features (huge file, slow connection) so I'm extracting them myself as suggested in the original paper (but without training on other datasets). Do you think this huge difference in the scores may depend on the image features? Is there any issue you had to face while implementing this project before getting the amazing results you described in the README?
Thanks!
Hi, that is really surprising. I would expect to find small differences in the scores but that is huge. It might be, as you said, because of the different bottom up features since they used the Visual Genome dataset for training. Based on your results it seems like the specific bottom up features really do make a huge difference.
I do not remember facing any issues like that....I only remember that the BLEU-4 score degraded after passing ~30 epochs.
Thank you so much for your answer! I'll try to figure it out..
Hi. I am adopting your code to report scores on Bottom-Up and Top-Down paper. However, when calculating the scores using the coco caption toolkit, there is some difference (especially in the CIDEr score) between the ones you reported and the ones I got. You can see below the attached file and the
json
file that includes the captions generated on Karpathy Test images, using the trained model you provided. I suppose the problem is from the evaluation toolkit you used?bottom_up_test.zip