Closed coldmanck closed 6 years ago
Hi @coldmanck,
The spatial features are (x1, y1, x2, y2, width, height) for each bounding box, where (x1,y1) is the top left corner of the box and (x2,y2) is the bottom right corner. We were investigating the use of these spatial features for another part of our project, but they are not used in the model in this repo. I hope that helps!
@xiaoxiao26
Thank you very much! :+1:
@coldmanck have you successfully adapted this code to image captioning task? I a newer in this field, could you tell me how to modify the code in details?
@ivy94419 In fact you can refer to this repo for the code of bottom-up-attention for captioning now.
Yeah I have found this project and I thought it only implement the CVPR2017 paper "Self-critical sequence training for image captioning", so it also implemented CVPR2018 “Bottom-up and top-down attention for image captioning and visual question answering”? which file can distinguish two paper implementations?
Hi @hengyuan-hu
Thank you for your fantastic work. I am trying to adapt your code to image captioning. I follow your code to read out the tsv file and I found that the shape of spatial:
spatials_features.shape = (82783, 36, 6)
. I really do not know where this feature came from? Could you please explain it to me? I only know theimage_features.shape = (82783, 36, 2048)
is because the image feature is 2048-d.Thanks!