Closed PumpingAcorns closed 4 years ago
Hi @PumpingAcorns, our eval code is the official evaluation code from ActivityNet 2017 Challenge. It had a few major bug fixes towards ActivityNet 2018 Challenge (for example here). Basically, in the old language eval code, the proposals that do not overlap with any GT segments are not penalized. You can also find this info in our repo.
Hi @LuoweiZhou I seems to meet the same problem. So, actually do you mean under the new evaluation standard, the scores will be very low due to the penalization on the proposals that do not overlap with GT segments?
It seems too low compared with the reported scores..
@Unispac Yes. For further assist regarding old/new evaluation metrics please refer to repo densevid_eval. I'd recommend to include both as the old one emphasizes on the caption quality and the new one considers both localization quality and caption quality.
@LuoweiZhou I notice that the model gives too many proposals for a single video, and all their scores are very high. If the model were used in an application environment, would there be many redundant sentences? I think it would be hard to set a thereshold to keep the number of sentence low because there are so many proposals that get a high score...
This is due to the nature of the event/action proposal problem, which favors recall more than precision. Details on the evaluation metric is here. Would be better to adopt a metric that can balance between recall and precision.
Hi, @LuoweiZhou I run the eval code from ActivityNet 2017 Challenge using the pre-trained model that you released. I'd like to evaluate the YouCook2 dataset rather than the ActivityNet caption dataset, thus evaluating command is as follow:
python tools/densevid_eval/evaluate.py -s results/densecap_validation_yc2-2L-gt-mask-19.json -r data/yc2/val_yc2.json
I got the result as:
--------------------------------------------------------------------------------
tIoU: 0.3
--------------------------------------------------------------------------------
| CIDEr: 16.6417
| Bleu_4: 1.4345
| Bleu_3: 4.3577
| Bleu_2: 10.7241
| Bleu_1: 24.1527
| Precision: 63.3868
| ROUGE_L: 24.9169
| METEOR: 9.1623
| Recall: 77.6682
--------------------------------------------------------------------------------
tIoU: 0.5
--------------------------------------------------------------------------------
| CIDEr: 18.5657
| Bleu_4: 1.3480
| Bleu_3: 4.1988
| Bleu_2: 10.8403
| Bleu_1: 24.2929
| Precision: 29.9825
| ROUGE_L: 25.0485
| METEOR: 9.2756
| Recall: 69.6433
--------------------------------------------------------------------------------
tIoU: 0.7
--------------------------------------------------------------------------------
| CIDEr: 21.6450
| Bleu_4: 1.1583
| Bleu_3: 3.6553
| Bleu_2: 10.4516
| Bleu_1: 23.8718
| Precision: 9.2781
| ROUGE_L: 24.2764
| METEOR: 9.1550
| Recall: 58.7672
--------------------------------------------------------------------------------
tIoU: 0.9
--------------------------------------------------------------------------------
| CIDEr: 33.8437
| Bleu_4: 0.6993
| Bleu_3: 2.7428
| Bleu_2: 8.0622
| Bleu_1: 19.4044
| Precision: 0.8960
| ROUGE_L: 17.6077
| METEOR: 7.3083
| Recall: 33.5861
--------------------------------------------------------------------------------
Average across all tIoUs
--------------------------------------------------------------------------------
| CIDEr: 22.6740
| Bleu_4: 1.1600
| Bleu_3: 3.7386
| Bleu_2: 10.0195
| Bleu_1: 22.9305
| Precision: 25.8858
| ROUGE_L: 22.9624
| METEOR: 8.7253
| Recall: 59.9162
Actually, the result is different from the reported score described in Table 3 in your paper. Would you like to tell me how to get the same score? Thanks.
Your results correspond to the last row on the last two columns (0.30 and 6.58). The model was improved since our initial paper submission. You can use either one as your reported baseline results and note if it is from the latest codebase (preferred) or the original paper.
OK, Thank you for your reply.
Hi ,I run the the evaluation code from ActivityNet 2017 using the pre-trained model that you released. I used the activity net dataset. However, the scores reported are still very low as given in the paper? You said that the old evaluation code does not penalize the the proposals that do not overlap with GT segments but still the scores are still very low. Can you please explain why this is happening ?
Marked as duplicate with this issue: https://github.com/salesforce/densecap/issues/41
Hello, I was wondering in your paper in Table 1 if the reported scores are the scores that are averaged across all the tIoU thresholds or if it is from 0.3. I am unable to attain scores close to those values - is it possible that the model checkpoint provided on the repo is an earlier checkpoint than the one used for the reported scores?
Thanks for your time