What features are used to train a VQA model? DO you use only 2048-dimension features?

peteanderson80 / bottom-up-attention

Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

http://panderson.me/up-down-attention/

MIT License

1.43k stars 379 forks source link

What features are used to train a VQA model? DO you use only 2048-dimension features? #29

Open cengzy14 opened 6 years ago

cengzy14 commented 6 years ago

In your code, the image_id, image_h, image_w, num_boxes, boxes, features were extracted and saved. But in your paper, it seems that only features are used to present the image. Do you use the embedding of the predicted classes or bbox to train a VQA model?

peteanderson80 commented 6 years ago

No we didn't use the class labels or the bbox. I did some initial experiments like that but performance didn't change much. Mostly we used the boxes just for visualization.