Final results - Githubissues

Hi, v-iashin! After finishing the evaluation process, I got the following results using your pre-trained models (best_cap_model.pt & best_prop_model.pt) in this GitHub:

0411021941: learned_props 1by1 26 @ 0: 100%|██████████| 6998/6998 [6:58:57<00:00, 3.59s/it] PTBTokenizer tokenized 12044821 tokens at 3094732.62 tokens per second. PTBTokenizer tokenized 11638061 tokens at 3382104.47 tokens per second. PTBTokenizer tokenized 12044821 tokens at 3013699.36 tokens per second. ... PTBTokenizer tokenized 1499705 tokens at 1944330.17 tokens per second. ./captioning_results_learned_props_e26.json {0.3: {'Bleu_1': 0.21855237245090708, 'Bleu_2': 0.11829910025778011, 'Bleu_3': 0.06527576698633347, 'Bleu_4': 0.03286126222797009, 'METEOR': 0.11318173026044272, 'ROUGE_L': 0.2214381639467888, 'CIDEr': 0.12540909027725913, 'Recall': 0.7653550052445841, 'Precision': 0.8432352890940963}, 0.5: {'Bleu_1': 0.1797576627363622, 'Bleu_2': 0.09703088725902112, 'Bleu_3': 0.05261611349091552, 'Bleu_4': 0.025653078928295117, 'METEOR': 0.10449090865385391, 'ROUGE_L': 0.17540336633318065, 'CIDEr': 0.12255632406512765, 'Recall': 0.6267787136401878, 'Precision': 0.574749139042341}, 0.7: {'Bleu_1': 0.10613642314342578, 'Bleu_2': 0.057339390086598954, 'Bleu_3': 0.030490115612125883, 'Bleu_4': 0.014518221261694548, 'METEOR': 0.07997454689235316, 'ROUGE_L': 0.09942682704857322, 'CIDEr': 0.10579470701931824, 'Recall': 0.49080783704361436, 'Precision': 0.2893747835081578}, 0.9: {'Bleu_1': 0.028548101221267407, 'Bleu_2': 0.01514513993899966, 'Bleu_3': 0.007702772567403976, 'Bleu_4': 0.003440348885064907, 'METEOR': 0.03290084691301671, 'ROUGE_L': 0.026592904083137726, 'CIDEr': 0.048605469972506386, 'Recall': 0.3467758615092933, 'Precision': 0.07477945163071767}, 'Average across tIoUs': {'Bleu_1': 0.1332486398879906, 'Bleu_2': 0.07195362938559996, 'Bleu_3': 0.03902119216419472, 'Bleu_4': 0.019118227825756166, 'METEOR': 0.08263700817991662, 'ROUGE_L': 0.1307153153529201, 'CIDEr': 0.10059139783355285, 'Recall': 0.55742935435942, 'Precision': 0.44553466581882817}}.

Whether these scores (e.g. Blue, Meteor,...) should be multiplied by 100 as re-scaling, to be roughly the same as the scores in your paper? Generally, is results in dense video captioning papers multiplied by 100 as re-scaling? Hope your reply!!

v-iashin / BMT

Final results #52