hengyuan-hu / bottom-up-attention-vqa

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.
GNU General Public License v3.0
754 stars 181 forks source link

Dimension of spatial features #7

Closed coldmanck closed 6 years ago

coldmanck commented 6 years ago

Hi @hengyuan-hu

Thank you for your fantastic work. I am trying to adapt your code to image captioning. I follow your code to read out the tsv file and I found that the shape of spatial: spatials_features.shape = (82783, 36, 6). I really do not know where this feature came from? Could you please explain it to me? I only know the image_features.shape = (82783, 36, 2048) is because the image feature is 2048-d.

Thanks!

xiaoxiao26-zz commented 6 years ago

Hi @coldmanck,

The spatial features are (x1, y1, x2, y2, width, height) for each bounding box, where (x1,y1) is the top left corner of the box and (x2,y2) is the bottom right corner. We were investigating the use of these spatial features for another part of our project, but they are not used in the model in this repo. I hope that helps!

coldmanck commented 6 years ago

@xiaoxiao26

Thank you very much! :+1:

ivy94419 commented 6 years ago

@coldmanck have you successfully adapted this code to image captioning task? I a newer in this field, could you tell me how to modify the code in details?

coldmanck commented 6 years ago

@ivy94419 In fact you can refer to this repo for the code of bottom-up-attention for captioning now.

ivy94419 commented 6 years ago

Yeah I have found this project and I thought it only implement the CVPR2017 paper "Self-critical sequence training for image captioning", so it also implemented CVPR2018 “Bottom-up and top-down attention for image captioning and visual question answering”? which file can distinguish two paper implementations?