microsoft / UniVL

An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
https://arxiv.org/abs/2002.06353
MIT License
339 stars 54 forks source link

About msrvtt retrieval results #17

Closed zhangliang-04 closed 3 years ago

zhangliang-04 commented 3 years ago

I found that the MSRVTT text-to-video retrieval performance under FT-Joint setting released in the readme is R@1: 0.2720 - R@5: 0.5570 - R@10: 0.6870 - Median R: 4.0, but the result in the paper is R@1: 0.206 - R@5: 0.491 - R@10: 0.629 - Median R: 6.0. What is the difference between them? Addtionally, what is the performance of the FT-Align setting should be? It seems to be forgotten in the readme. Actually I tried to finetune use the scripts released by the repo but got worse score than FT-Joint on MSRVTT.

ArrowLuo commented 3 years ago

Hi @zhangliang-04,

  1. Our paper reports results on ‘Training-7K’ follows the data splits from (Miech et al., 2019). However, the readme reports the results of ‘Training-9K’ which follows the data splits from (Gabeur et al., 2020). You can find two files, MSRVTT_train.7k.csv and MSRVTT_train.9k.csv in our released msrvtt.zip.
  2. Our running on FT-Align (‘Training-9K’ ) has a smaller batch size due to our GPUs limited. Thus, the results on ‘Training-9K’ are also not an obvious advantage over FT-Joint. Our experience is that the finetune hyper-parameters are important, and the FT-Align may not be the same as the FT-Joint. You can test on ‘Training-7K’ as our paper reported.