lukemelas / image-paragraph-captioning

[EMNLP 2018] Training for Diversity in Image Paragraph Captioning
91 stars 23 forks source link

Cannot Replicate Results Described in Paper #8

Open arjung128 opened 5 years ago

arjung128 commented 5 years ago

I cloned the repo last month (before the most recently updated bug pertaining to the evaluation was fixed) but I made the (one line?) fix locally. I then tried training a model from scratch and the following are the results I obtained:

epochs = 25 xe / 25 sc (as described in the paper) Bleu_1: 0.419 Bleu_2: 0.262 Bleu_3: 0.165 Bleu_4: 0.101 METEOR: 0.166 ROUGE_L: 0.313 CIDEr: 0.257

epochs = 30 xe / 170 sc (default in the repo) Bleu_1: 0.430 Bleu_2: 0.271 Bleu_3: 0.171 Bleu_4: 0.105 METEOR: 0.170 ROUGE_L: 0.312 CIDEr: 0.270

Here are the results the paper claims to achieve (using epochs = 25 xe / 25 sc): Bleu_1: 43.54 Bleu_2: 27.44 Bleu_3: 17.33 Bleu_4: 10.58 METEOR: 17.86 CIDEr: 30.63

Any ideas for this discrepancy?

lukemelas commented 5 years ago

Hi @arjung128 , thanks for the issue. The reinforcement learning segment of training is very hyperparameter sensitive and the default parameters in the repo (lr, etc.) are not optimal. In particular, the CIDEr score seems to have large run-to-run variance. That being said, it's good to see that all your other metrics (BLEU-1, BLEU-4, METEOR) are very close to those reported in the paper.

Due to the sensitivity of the RL training, I've been developing another version of this repo based on a different approach to paragraph captioning. It should be much more stable while still giving good results.

I'll release it soon (have to go through some conference submission stuff first). Sorry to keep you waiting in the meantime!