salesforce / densecap

BSD 3-Clause "New" or "Revised" License
187 stars 60 forks source link

Reported Scores #16

Closed PumpingAcorns closed 4 years ago

PumpingAcorns commented 5 years ago

Hello, I was wondering in your paper in Table 1 if the reported scores are the scores that are averaged across all the tIoU thresholds or if it is from 0.3. I am unable to attain scores close to those values - is it possible that the model checkpoint provided on the repo is an earlier checkpoint than the one used for the reported scores?

Thanks for your time

LuoweiZhou commented 5 years ago

Hi @PumpingAcorns, our eval code is the official evaluation code from ActivityNet 2017 Challenge. It had a few major bug fixes towards ActivityNet 2018 Challenge (for example here). Basically, in the old language eval code, the proposals that do not overlap with any GT segments are not penalized. You can also find this info in our repo.

Unispac commented 5 years ago

Hi @LuoweiZhou I seems to meet the same problem. So, actually do you mean under the new evaluation standard, the scores will be very low due to the penalization on the proposals that do not overlap with GT segments?

Unispac commented 5 years ago

Calculated tIoU: 0.3, Bleu_1: 0.196
Calculated tIoU: 0.3, Bleu_2: 0.094
Calculated tIoU: 0.3, Bleu_3: 0.048
Calculated tIoU: 0.3, Bleu_4: 0.024

It seems too low compared with the reported scores..

LuoweiZhou commented 5 years ago

@Unispac Yes. For further assist regarding old/new evaluation metrics please refer to repo densevid_eval. I'd recommend to include both as the old one emphasizes on the caption quality and the new one considers both localization quality and caption quality.

Unispac commented 5 years ago

@LuoweiZhou I notice that the model gives too many proposals for a single video, and all their scores are very high. If the model were used in an application environment, would there be many redundant sentences? I think it would be hard to set a thereshold to keep the number of sentence low because there are so many proposals that get a high score...

LuoweiZhou commented 5 years ago

This is due to the nature of the event/action proposal problem, which favors recall more than precision. Details on the evaluation metric is here. Would be better to adopt a metric that can balance between recall and precision.

awkrail commented 4 years ago

Hi, @LuoweiZhou I run the eval code from ActivityNet 2017 Challenge using the pre-trained model that you released. I'd like to evaluate the YouCook2 dataset rather than the ActivityNet caption dataset, thus evaluating command is as follow:

python tools/densevid_eval/evaluate.py -s results/densecap_validation_yc2-2L-gt-mask-19.json -r data/yc2/val_yc2.json

I got the result as:

--------------------------------------------------------------------------------
tIoU:  0.3
--------------------------------------------------------------------------------
| CIDEr: 16.6417
| Bleu_4: 1.4345
| Bleu_3: 4.3577
| Bleu_2: 10.7241
| Bleu_1: 24.1527
| Precision: 63.3868
| ROUGE_L: 24.9169
| METEOR: 9.1623
| Recall: 77.6682
--------------------------------------------------------------------------------
tIoU:  0.5
--------------------------------------------------------------------------------
| CIDEr: 18.5657
| Bleu_4: 1.3480
| Bleu_3: 4.1988
| Bleu_2: 10.8403
| Bleu_1: 24.2929
| Precision: 29.9825
| ROUGE_L: 25.0485
| METEOR: 9.2756
| Recall: 69.6433
--------------------------------------------------------------------------------
tIoU:  0.7
--------------------------------------------------------------------------------
| CIDEr: 21.6450
| Bleu_4: 1.1583
| Bleu_3: 3.6553
| Bleu_2: 10.4516
| Bleu_1: 23.8718
| Precision: 9.2781
| ROUGE_L: 24.2764
| METEOR: 9.1550
| Recall: 58.7672
--------------------------------------------------------------------------------
tIoU:  0.9
--------------------------------------------------------------------------------
| CIDEr: 33.8437
| Bleu_4: 0.6993
| Bleu_3: 2.7428
| Bleu_2: 8.0622
| Bleu_1: 19.4044
| Precision: 0.8960
| ROUGE_L: 17.6077
| METEOR: 7.3083
| Recall: 33.5861
--------------------------------------------------------------------------------
Average across all tIoUs
--------------------------------------------------------------------------------
| CIDEr: 22.6740
| Bleu_4: 1.1600
| Bleu_3: 3.7386
| Bleu_2: 10.0195
| Bleu_1: 22.9305
| Precision: 25.8858
| ROUGE_L: 22.9624
| METEOR: 8.7253
| Recall: 59.9162

Actually, the result is different from the reported score described in Table 3 in your paper. Would you like to tell me how to get the same score? Thanks.

スクリーンショット 2020-07-18 16 43 12
LuoweiZhou commented 4 years ago

Your results correspond to the last row on the last two columns (0.30 and 6.58). The model was improved since our initial paper submission. You can use either one as your reported baseline results and note if it is from the latest codebase (preferred) or the original paper.

awkrail commented 4 years ago

OK, Thank you for your reply.

lucky-23 commented 4 years ago

Hi ,I run the the evaluation code from ActivityNet 2017 using the pre-trained model that you released. I used the activity net dataset. However, the scores reported are still very low as given in the paper? You said that the old evaluation code does not penalize the the proposals that do not overlap with GT segments but still the scores are still very low. Can you please explain why this is happening ?

LuoweiZhou commented 4 years ago

Marked as duplicate with this issue: https://github.com/salesforce/densecap/issues/41