Closed cengzy14 closed 6 years ago
No, referring to this paper, feature exacted by Faster RCNN is used as hard attention of spatial visual feature.
no we don't use the embedding of predicted classes. In fact we tried to use that but obtained no improvement. Of course we use detected image features.
The image_id, image_h, image_w, num_boxes, boxes, features were extracted and saved. However, it seems that only features are used to present the image. Do you use the embedding of the predicted classes or bbox to train a VQA model?