Closed EByrdS closed 3 years ago
Yes, the metrics need to be multiplied by 100. Although the results you're getting seem to be quite low.. so I don't think you're correctly using their pre-trained model (if that's your intention). The results I get (on the captioning test set):
{
"Bleu_1": 0.8199021730414398,
"Bleu_2": 0.6724468061157247,
"Bleu_3": 0.5307472138105488,
"Bleu_4": 0.4096071309909314,
"METEOR": 0.3107534987668901,
"ROUGE_L": 0.6094049152062189,
"CIDEr": 1.4086486963697127,
"SPICE": 0.25164499355739517
}
I am training on a new Image Captioning set, so I expect some lower values. Thank you!
I am using the Image Captioning downstream task with the file
run_captioning.py
.When evaluating a model, either in the training process (using
--evaluate_during_training
) or only evaluation (using--do_eval
), the program calculates the metrics Bleu1, Bleu2, Bleu3, Bleu4, ROUGEL, CIDEr, and SPICE (I eliminated METEOR as it was causing troubles).This is an example of a results output:
Do the metrics need to be multiplied, for example by 100, to be on the same scale of the results reported in this repository?
The results in this repository are:
Note how the Bleu4, CIDEr and SPICE scores are in completely different scales.