krasserm / fairseq-image-captioning

Transformer-based image captioning extension for pytorch/fairseq
Apache License 2.0
313 stars 56 forks source link

Performance reproduce #20

Closed Beanocean closed 4 years ago

Beanocean commented 4 years ago

After following the instructions in the Readme.md, the results I reproduced are (I run the experiment for 3 times, and they produce the same performance):

Scores:
=======
Bleu_1: 0.735
Bleu_2: 0.573
Bleu_3: 0.440
Bleu_4: 0.338
METEOR: 0.281
ROUGE_L: 0.558
CIDEr: 1.116
SPICE: 0.211

There is a gap for the BLEU and CIDEr metric. I conducted the experiments on two V100 GPU.

krasserm commented 4 years ago

Can you please share the results from both CE training and self-critical sequence training (incl. the corresponding checkpoint numbers)?

krasserm commented 4 years ago

Or are the number you reported from CE training only?

Beanocean commented 4 years ago

Or are the number you reported from CE training only?

The best checkpoint is generated on epoch 23. I will report the sequence-level result after finishing the training process.

krasserm commented 4 years ago

OK, in this case there is only a small gap but I'll redo a CE training run and see if I can reproduce what you reported. Initially, I thought you reported numbers from SCST which would be a too large gap.

krasserm commented 4 years ago

Here's what I get from the best checkpoint (20):

Scores:
=======
Bleu_1: 0.740
Bleu_2: 0.578
Bleu_3: 0.445
Bleu_4: 0.343
METEOR: 0.279
ROUGE_L: 0.558
CIDEr: 1.117
SPICE: 0.208

BLEU scores are close to those in docs, CIDEr is slightly lower. However, CIDEr of checkpoint 21 is 1.122. There's variance across training runs, so that might be the reason for the difference to the docs but I'd need to do a statistical test to check whether this is significant or not. Compared to what you can achieve with SCST these differences are still quite small. Waiting for your SCST results before running my own ...

Beanocean commented 4 years ago

@krasserm Thanks for your help. I have started the sequence-level training, while it's really time-consuming. I will report the results asap.

krasserm commented 4 years ago

You could easily try increasing --max-sentences e.g. from 5 to 10 to see if you are getting even better results. I've set it to quite a low value because I ran initial tests on smaller GPUs with "only" 8 GB memory each.

Beanocean commented 4 years ago

After increasing the batch size to 10 sentences, the results I obtained are (I resumed the sequence-level training from checkpoint 20):

Bleu_1: 0.794
Bleu_2: 0.647
Bleu_3: 0.506
Bleu_4: 0.389
METEOR: 0.282
ROUGE_L: 0.584
CIDEr: 1.253
krasserm commented 4 years ago

Thanks for running SCST with a larger batch size, seems it doesn't improve much over --max-sentences 5 (i.e. is close to the document numbers). Do you also have results for --max-sentences 5?

Beanocean commented 4 years ago

Hi, @krasserm , the results for --max-sentences 5 is:

Bleu_2: 0.634
Bleu_3: 0.495
Bleu_4: 0.380
METEOR: 0.276
ROUGE_L: 0.581
CIDEr: 1.221