Cannot Replicate Results Described in Paper

lukemelas / image-paragraph-captioning

[EMNLP 2018] Training for Diversity in Image Paragraph Captioning

91 stars 23 forks source link

I cloned the repo last month (before the most recently updated bug pertaining to the evaluation was fixed) but I made the (one line?) fix locally. I then tried training a model from scratch and the following are the results I obtained:

epochs = 25 xe / 25 sc (as described in the paper) Bleu_1: 0.419 Bleu_2: 0.262 Bleu_3: 0.165 Bleu_4: 0.101 METEOR: 0.166 ROUGE_L: 0.313 CIDEr: 0.257

epochs = 30 xe / 170 sc (default in the repo) Bleu_1: 0.430 Bleu_2: 0.271 Bleu_3: 0.171 Bleu_4: 0.105 METEOR: 0.170 ROUGE_L: 0.312 CIDEr: 0.270

Here are the results the paper claims to achieve (using epochs = 25 xe / 25 sc): Bleu_1: 43.54 Bleu_2: 27.44 Bleu_3: 17.33 Bleu_4: 10.58 METEOR: 17.86 CIDEr: 30.63

Any ideas for this discrepancy?

Hi @arjung128 , thanks for the issue. The reinforcement learning segment of training is very hyperparameter sensitive and the default parameters in the repo (lr, etc.) are not optimal. In particular, the CIDEr score seems to have large run-to-run variance. That being said, it's good to see that all your other metrics (BLEU-1, BLEU-4, METEOR) are very close to those reported in the paper.

Due to the sensitivity of the RL training, I've been developing another version of this repo based on a different approach to paragraph captioning. It should be much more stable while still giving good results.

I'll release it soon (have to go through some conference submission stuff first). Sorry to keep you waiting in the meantime!

lukemelas / image-paragraph-captioning

Cannot Replicate Results Described in Paper #8