v-iashin / BMT

Source code for "Bi-modal Transformer for Dense Video Captioning" (BMVC 2020)
https://v-iashin.github.io/bmt
MIT License
226 stars 57 forks source link

Final results #52

Closed AIENG2020 closed 1 year ago

AIENG2020 commented 1 year ago

Hi, v-iashin! After finishing the evaluation process, I got the following results using your pre-trained models (best_cap_model.pt & best_prop_model.pt) in this GitHub:

0411021941: learned_props 1by1 26 @ 0: 100%|██████████| 6998/6998 [6:58:57<00:00, 3.59s/it] PTBTokenizer tokenized 12044821 tokens at 3094732.62 tokens per second. PTBTokenizer tokenized 11638061 tokens at 3382104.47 tokens per second. PTBTokenizer tokenized 12044821 tokens at 3013699.36 tokens per second. ... PTBTokenizer tokenized 1499705 tokens at 1944330.17 tokens per second. ./captioning_results_learned_props_e26.json {0.3: {'Bleu_1': 0.21855237245090708, 'Bleu_2': 0.11829910025778011, 'Bleu_3': 0.06527576698633347, 'Bleu_4': 0.03286126222797009, 'METEOR': 0.11318173026044272, 'ROUGE_L': 0.2214381639467888, 'CIDEr': 0.12540909027725913, 'Recall': 0.7653550052445841, 'Precision': 0.8432352890940963}, 0.5: {'Bleu_1': 0.1797576627363622, 'Bleu_2': 0.09703088725902112, 'Bleu_3': 0.05261611349091552, 'Bleu_4': 0.025653078928295117, 'METEOR': 0.10449090865385391, 'ROUGE_L': 0.17540336633318065, 'CIDEr': 0.12255632406512765, 'Recall': 0.6267787136401878, 'Precision': 0.574749139042341}, 0.7: {'Bleu_1': 0.10613642314342578, 'Bleu_2': 0.057339390086598954, 'Bleu_3': 0.030490115612125883, 'Bleu_4': 0.014518221261694548, 'METEOR': 0.07997454689235316, 'ROUGE_L': 0.09942682704857322, 'CIDEr': 0.10579470701931824, 'Recall': 0.49080783704361436, 'Precision': 0.2893747835081578}, 0.9: {'Bleu_1': 0.028548101221267407, 'Bleu_2': 0.01514513993899966, 'Bleu_3': 0.007702772567403976, 'Bleu_4': 0.003440348885064907, 'METEOR': 0.03290084691301671, 'ROUGE_L': 0.026592904083137726, 'CIDEr': 0.048605469972506386, 'Recall': 0.3467758615092933, 'Precision': 0.07477945163071767}, 'Average across tIoUs': {'Bleu_1': 0.1332486398879906, 'Bleu_2': 0.07195362938559996, 'Bleu_3': 0.03902119216419472, 'Bleu_4': 0.019118227825756166, 'METEOR': 0.08263700817991662, 'ROUGE_L': 0.1307153153529201, 'CIDEr': 0.10059139783355285, 'Recall': 0.55742935435942, 'Precision': 0.44553466581882817}}.

Whether these scores (e.g. Blue, Meteor,...) should be multiplied by 100 as re-scaling, to be roughly the same as the scores in your paper? Generally, is results in dense video captioning papers multiplied by 100 as re-scaling? Hope your reply!!

AIENG2020 commented 1 year ago

I have read some papers and found that all values are reported in percentage