Closed Beanocean closed 4 years ago
Can you please share the results from both CE training and self-critical sequence training (incl. the corresponding checkpoint numbers)?
Or are the number you reported from CE training only?
Or are the number you reported from CE training only?
The best checkpoint is generated on epoch 23. I will report the sequence-level result after finishing the training process.
OK, in this case there is only a small gap but I'll redo a CE training run and see if I can reproduce what you reported. Initially, I thought you reported numbers from SCST which would be a too large gap.
Here's what I get from the best checkpoint (20):
Scores:
=======
Bleu_1: 0.740
Bleu_2: 0.578
Bleu_3: 0.445
Bleu_4: 0.343
METEOR: 0.279
ROUGE_L: 0.558
CIDEr: 1.117
SPICE: 0.208
BLEU scores are close to those in docs, CIDEr is slightly lower. However, CIDEr of checkpoint 21 is 1.122
. There's variance across training runs, so that might be the reason for the difference to the docs but I'd need to do a statistical test to check whether this is significant or not. Compared to what you can achieve with SCST these differences are still quite small. Waiting for your SCST results before running my own ...
@krasserm Thanks for your help. I have started the sequence-level training, while it's really time-consuming. I will report the results asap.
You could easily try increasing --max-sentences
e.g. from 5 to 10 to see if you are getting even better results. I've set it to quite a low value because I ran initial tests on smaller GPUs with "only" 8 GB memory each.
After increasing the batch size to 10 sentences, the results I obtained are (I resumed the sequence-level training from checkpoint 20):
Bleu_1: 0.794
Bleu_2: 0.647
Bleu_3: 0.506
Bleu_4: 0.389
METEOR: 0.282
ROUGE_L: 0.584
CIDEr: 1.253
Thanks for running SCST with a larger batch size, seems it doesn't improve much over --max-sentences 5
(i.e. is close to the document numbers). Do you also have results for --max-sentences 5
?
Hi, @krasserm , the results for --max-sentences 5
is:
Bleu_2: 0.634
Bleu_3: 0.495
Bleu_4: 0.380
METEOR: 0.276
ROUGE_L: 0.581
CIDEr: 1.221
After following the instructions in the Readme.md, the results I reproduced are (I run the experiment for 3 times, and they produce the same performance):
There is a gap for the BLEU and CIDEr metric. I conducted the experiments on two V100 GPU.