microsoft / Oscar

Oscar and VinVL
MIT License
1.04k stars 249 forks source link

VINVL image captioning features #94

Open EddieKro opened 3 years ago

EddieKro commented 3 years ago

Hello! I have a question about extracting region features for image captioning:

nihirv commented 3 years ago

I was about to open an issue with a similar question... In fact I'm struggling to see how we can get the 2048/2054 dimension vector for captioning?

So it seems that in run_captioning.py#115:

features = np.frombuffer(base64.b64decode(feat_info['features']), np.float32).reshape((num_boxes, -1))

features will be of dimension 1027. Whereas if we look at the VQA example (run_vqa.py#413):

feat = np.frombuffer(base64.b64decode(arr[2]), dtype=np.float32).reshape((-1, self.args.img_feature_dim))

self.args.img_feature_dim = 2054.

With the IC code, we can't reshape it to (-1, 2054) as there are shape mismatches - although reshaping to (-1, 1027) is fine. But I'm confused as to where the 3 extra dimensions come from (assuming that 1024 is the feature dimension).

Would be also be good to get clarification on whether the number of feature boxes is different to the number of objects in the image (which comes from X.label.tsv) because the object list from X.label.tsv is a set as opposed to a list? (In which case the bounding boxes would only be valid for one instance of the object in the image?)

EDIT: So it seems that the pred files that are generated by running run_captioning.py contain the 2054 dimensional vectors 👍. To weigh in my opinion on your problem OP, maybe the feature vectors we are given have been processed by a model already? And thus we can't trivially recover the spatial positions?

EddieKro commented 3 years ago

I've managed to run inference on custom images by extracting 2048 feature vector for each bbox, and then concatenating to it coordinates of a box divided by width and height and corresponding width and height of the box ([xtl/w, ytl/h, xbr/w, ybr/h, (xbr-xtl)/w, (ybr-ytl)/h]), where w,h represent image's width and height, and xtl, ytl, xbr, ybr are coordinates of a bbox. The resulting captions were good, so I guess I got it right. The key aspect of getting 2048+ features is to make sure that features are stored in float32.

nihirv commented 3 years ago

I've managed to run inference on custom images by extracting 2048 feature vector for each bbox, and then concatenating to it coordinates of a box divided by width and height and corresponding width and height of the box ([xtl/w, ytl/h, xbr/w, ybr/h, (xbr-xtl)/w, (ybr-ytl)/h]), where w,h represent image's width and height, and xtl, ytl, xbr, ybr are coordinates of a bbox. The resulting captions were good, so I guess I got it right. The key aspect of getting 2048+ features is to make sure that features are stored in float32.

Thank you!!! Very useful information at a very timely time. 👍

liutianling commented 3 years ago

@EddieKro Can you give some demo about how to extract feature of a input image? Or, how to do prediction with the input image ? Thanks a lot.

EddieKro commented 3 years ago

@liutianling it's quite a process)

  1. Extract image features for a folder of images using sg_benchmark as described here (you'll have to create some .tsv and .lineindex files first and edit yaml config file). Note that it is better to create an empty test.label file, because otherwise, inference won't work.
  2. sg_benchmark will create file predictions.tsv, where we will need features, boxes, and class and confidence for each box.
  3. To run VinVL inference you'll have to create feature.tsv, label.tsv and .yaml file using info from predictions.tsv. Note that to add 6 additional features you need to know height and width of each image, which will be stored in hw.tsv file required for sg_benchmark. Here's the gist with the example code

Note I only managed to run run_captioning.py using COCO, other tasks and datasets may require different inputs.

liutianling commented 3 years ago

@EddieKro Great Thanks for you reply and detail steps! Thanks! I will have a try!

akkapakasaikiran commented 3 years ago

@liutianling it's quite a process)

  1. Extract image features for a folder of images using sg_benchmark as described here (you'll have to create some .tsv and .lineindex files first and edit yaml config file). Note that it is better to create an empty test.label file, because otherwise, inference won't work.
  2. sg_benchmark will create file predictions.tsv, where we will need features, boxes, and class and confidence for each box.
  3. To run VinVL inference you'll have to create feature.tsv, label.tsv and .yaml file using info from predictions.tsv. Note that to add 6 additional features you need to know height and width of each image, which will be stored in hw.tsv file required for sg_benchmark. Here's the gist with the example code

Note I only managed to run run_captioning.py using COCO, other tasks and datasets may require different inputs.

I needed to generate input files for run_retrieval.py from a predictions.tsv file outputted by test_sg_net.py of scene_graph_benchmark(modification of step 3 above). This is a bit different from run_captioning.py. So I made a gist for the same, which is similar to the one provided by @EddieKro, and based on it. The gist can be found here. Differences: labels.tsv has image_h and image_w too and leaves out conf, and features.tsv has splits the encoding and num_rows into different columns instead of using a dictionary. No .yaml file is needed, but an image_id2idx.json file is used. I tested this on a custom dataset.

DavidInWuhanChina commented 3 years ago

@liutianling it's quite a process)

  1. Extract image features for a folder of images using sg_benchmark as described here (you'll have to create some .tsv and .lineindex files first and edit yaml config file). Note that it is better to create an empty test.label file, because otherwise, inference won't work.
  2. sg_benchmark will create file predictions.tsv, where we will need features, boxes, and class and confidence for each box.
  3. To run VinVL inference you'll have to create feature.tsv, label.tsv and .yaml file using info from predictions.tsv. Note that to add 6 additional features you need to know height and width of each image, which will be stored in hw.tsv file required for sg_benchmark. Here's the gist with the example code

Note I only managed to run run_captioning.py using COCO, other tasks and datasets may require different inputs.

I needed to generate input files for run_retrieval.py from a predictions.tsv file outputted by test_sg_net.py of scene_graph_benchmark(modification of step 3 above). This is a bit different from run_captioning.py. So I made a gist for the same, which is similar to the one provided by @EddieKro, and based on it. The gist can be found here. Differences: labels.tsv has image_h and image_w too and leaves out conf, and features.tsv has splits the encoding and num_rows into different columns instead of using a dictionary. No .yaml file is needed, but an image_id2idx.json file is used. I tested this on a custom dataset.

Can you show me the complete inference file?

akkapakasaikiran commented 3 years ago

Can you show me the complete inference file?

Sorry, I'm not sure I understand what you mean. The inference file I used was oscar/run_retrieval.py.

Jennifer-6 commented 2 years ago

in order to run run_captioning.py ,train.yaml is needed. train.yaml file is some required data(image feature,caption,labels),where is the train.yaml? or how to get train.yaml?

akkapakasaikiran commented 2 years ago

in order to run run_captioning.py ,train.yaml is needed. train.yaml file is some required data(image feature,caption,labels),where is the train.yaml? or how to get train.yaml?

Follow this, this, and this, in that order (they link to each other in a chain). You basically have to create the file yourself.

BigHyf commented 2 years ago

in order to run run_captioning.py ,train.yaml is needed. train.yaml file is some required data(image feature,caption,labels),where is the train.yaml? or how to get train.yaml? @akkapakasaikiran @Jennifer-6 hello, have you ever solve this problem, can you tell me the detail about vinvl_x152c4.yaml

ginlov commented 2 years ago

@liutianling it's quite a process)

1. Extract image features for a folder of images using sg_benchmark as described [here](https://github.com/microsoft/scene_graph_benchmark/issues/7#issuecomment-819357369) (you'll have to create some `.tsv` and `.lineindex` files first and edit yaml config file). Note that it is better to create an empty test.label file, because otherwise, inference won't work.

2. sg_benchmark will create file `predictions.tsv`, where we will need features, boxes, and class and confidence for each box.

3. To run VinVL inference you'll have to create `feature.tsv`, `label.tsv` and `.yaml` file using info from `predictions.tsv`. Note that to add 6 additional features you need to know height and width of each image, which will be stored in `hw.tsv` file required for sg_benchmark. Here's the [gist with the example code](https://gist.github.com/EddieKro/903ad08e85d670ff2b140a888d8c67c0)

Note I only managed to run run_captioning.py using COCO, other tasks and datasets may require different inputs.

How can i organize caption.json for fine-tune new dataset