microsoft / Oscar

Oscar and VinVL
MIT License
1.04k stars 251 forks source link

How should I interpret the Image Captioning metrics? #96

Closed EByrdS closed 3 years ago

EByrdS commented 3 years ago

I am using the Image Captioning downstream task with the file run_captioning.py.

When evaluating a model, either in the training process (using --evaluate_during_training) or only evaluation (using --do_eval), the program calculates the metrics Bleu1, Bleu2, Bleu3, Bleu4, ROUGEL, CIDEr, and SPICE (I eliminated METEOR as it was causing troubles).

This is an example of a results output:

{
    "Bleu_1": 0.24398833981673768,
    "Bleu_2": 0.10863639161535094,
    "Bleu_3": 0.05178606365735881,
    "Bleu_4": 0.024236911826604902,
    "ROUGE_L": 0.22220575556650995,
    "CIDEr": 0.1724441143646292,
    "SPICE": 0.10137202845604548
}

Do the metrics need to be multiplied, for example by 100, to be on the same scale of the results reported in this repository?

The results in this repository are:

Metric B@4 (Bleu4) M (Meteor) C (CIDEr) S (SPICE)
Oscar_B 40.5 29.7 137.6 22.8
Oscar_L 41.7 30.6 140.0 24.5

Note how the Bleu4, CIDEr and SPICE scores are in completely different scales.

nihirv commented 3 years ago

Yes, the metrics need to be multiplied by 100. Although the results you're getting seem to be quite low.. so I don't think you're correctly using their pre-trained model (if that's your intention). The results I get (on the captioning test set):

{
    "Bleu_1": 0.8199021730414398,
    "Bleu_2": 0.6724468061157247,
    "Bleu_3": 0.5307472138105488,
    "Bleu_4": 0.4096071309909314,
    "METEOR": 0.3107534987668901,
    "ROUGE_L": 0.6094049152062189,
    "CIDEr": 1.4086486963697127,
    "SPICE": 0.25164499355739517
}
EByrdS commented 3 years ago

I am training on a new Image Captioning set, so I expect some lower values. Thank you!