Closed zengyan-97 closed 2 years ago
Thanks again!
Thanks for your reply! For the second question, I am wondering what's the size of your vocabulary. It seems that there are 7000+ different answers in train+valid+test set (including 1000+ unseen answers in the test set). So, did you pick top-k frequent answers in train+valid set when modeling it as a classification task?
Only use train. The other tags appear in test & val classes but not in train are classified as wrong directly.
This follow previous work like ClipBERT for fair comparison.
Got it. Thanks!
I am wondering about some of your experimental settings of MSRVTT-QA. Could you please clarify it?
1) what's the image resolution, 224x224?
2) how do you deal with open-ended VQA like MSRVTT-QA? the paper only mentioned that you converted it to a classification task. Did you choose top-k answers? what's k then?