Closed mcogswell closed 6 years ago
you ignored this line "In order to be consistent with ‘human accuracies’, machine accuracies are averaged over all 10 choose 9 sets of human annotators. " If you actually average them over 10 choose 9 set you will see the hardcoded values should be correct.
Hmmm. Yup, that makes sense. Thanks for the code.
It looks like the VQA score isn't quite what's specified as the VQA evaluation metric. It underestimates the actual VQA score a bit. This pull request fixes that. Note that it requires re-caching the labels by running
tools/compute_softscore.py
again. When I ran this on some models I've been testing most scores went up by about 0.9.