showlab / all-in-one

[CVPR2023] All in One: Exploring Unified Video-Language Pre-training
https://arxiv.org/abs/2203.07303
277 stars 16 forks source link

Details of fine-tuning on MSRVTT-QA #7

Closed zengyan-97 closed 2 years ago

zengyan-97 commented 2 years ago

Hi,

I am wondering about some of your experimental settings of MSRVTT-QA. Could you please clarify it?

1) what's the image resolution, 224x224?

2) how do you deal with open-ended VQA like MSRVTT-QA? the paper only mentioned that you converted it to a classification task. Did you choose top-k answers? what's k then? ​

Thanks!

zengyan-97 commented 2 years ago
  1. did you also use validation set for training?

Thanks again!

FingerRec commented 2 years ago
  1. 224
  2. top-1, K is the length of voculbary
  3. No
zengyan-97 commented 2 years ago

Thanks for your reply! For the second question, I am wondering what's the size of your vocabulary. It seems that there are 7000+ different answers in train+valid+test set (including 1000+ unseen answers in the test set). So, did you pick top-k frequent answers in train+valid set when modeling it as a classification task?

FingerRec commented 2 years ago

Only use train. The other tags appear in test & val classes but not in train are classified as wrong directly.

This follow previous work like ClipBERT for fair comparison.

zengyan-97 commented 2 years ago

Got it. Thanks!