salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.45k stars 193 forks source link

Reproducing the VQA candidate answers from the dataset and paper #135

Open MagnusOstertag opened 5 months ago

MagnusOstertag commented 5 months ago

Hi, first of all, thanks for the amazing work!

You wrote in the paper: "For a fair comparison with existing methods, we constrain the decoder to only generate from the $3,192$ candidate answers". In the data to download however, the number of elements in the answer_list is only $3,128$. I first suspected a typo (and -1), because in the paper you're citing it says "The number of outputs is determined by the minimum occurrence of the answer in unique questions as nine times in the dataset, which is $3,129$."

When trying to reproduce the answer_list with the given answers or directly with VQAv2, I get a different number of answers and nearly 300 different answers. So how was the answer list actually created? (I count the number of unique answers (not questions, because then the problem is not unambiguous). I take as a threshold at least 9 occurrences of the answer, standardizing each answer as in VQAEval. When not only considering the VQAv2.0 answers, but also VisualGenome I get a much higher number of candidate answers.)

I further noticed that you seem to have excluded 7 questions from vqav2.0 in vqa_train/val, namely the questions with ID=268735002, 293514000, 147314003, 68003002, 451818000, 362391000, 196280004. Why was that?

Best, Magnus

MagnusOstertag commented 2 weeks ago

I understand that the excluded questions have no answers, so it makes perfect sense!