microsoft / Oscar

Oscar and VinVL
MIT License
1.04k stars 252 forks source link

Generating label.tsv and feature.tsv from image #33

Open sameerpande12 opened 4 years ago

sameerpande12 commented 4 years ago

Hi guys, I am trying to generate my own features.tsv and labels.tsv for my dataset, but I am stuck at the following:

  1. I have a slight confusion regarding what exactly these features are. Upon reading the "Oscar" paper, I can understand that per bounding box a feature vector is of type (v',z) where v' is P-dimensional (2048) and z is 6 dimensional (position). I have a difficulty in understanding where do these 2048 features come from. Initially, I thought that these were from the FC-layer of Faster-R-CNN but upon checking the FC-layer size is 4096 in Faster-R-CNN.

  2. The Oscar paper mentions, " Specifically, v and q are generated as follows. Given an image with K regions of objects (normally over-sampled and noisy), Faster R-CNN [28] is used to extract the visual semantics of each region". I have a slight confusion regarding how are these K-regions determined. Are these K-image regions the bound-boxes output by Faster-RCNN?

I am relatively new to this area. Any help would be appreciated.

shravan1394 commented 4 years ago

The information is kind of dispersed in the issues, I will summarize it here for anyone looking in the future.

The features are extracted using the bottom up attention model from https://github.com/peteanderson80/bottom-up-attention. You need to slightly modify the tools/generate_tsv.py to get the label.tsv and feature.tsv. The following code must be added to this file to create the exact format of featue.tsv box_width = boxes[:, 2] - boxes[:, 0] box_height = boxes[:, 3] - boxes[:, 1] scaled_width = box_width / image_width scaled_height = box_height / image_height scaled_x = boxes[:, 0] / image_width scaled_y = boxes[:, 1] / image_height scaled_width = scaled_width[..., np.newaxis] scaled_height = scaled_height[..., np.newaxis] scaled_x = scaled_x[..., np.newaxis] scaled_y = scaled_y[..., np.newaxis] spatial_features = np.concatenate( (scaled_x, scaled_y, scaled_x + scaled_width, scaled_y + scaled_height, scaled_width, scaled_height), axis=1) full_features = np.concatenate((features, spatial_features), axis=1) fea_base64 = base64.b64encode(full_features).decode('utf-8') fea_info = {'num_boxes': boxes.shape[0], 'feature': fea_base64} row = [[image_key, json.dumps(fea_info)]

I am attaching the file that I used for this purpose and to generate label.tsv as well. You might have to change the code depending on your data location and format. tsv_gen.py.zip

I still had some issues with csv Dictwriter generating strings with single quote while json loads requiring it as double quotes in run_captioning.py. I made modifications to run_captioning.py to make it work. If you guys have a better solution, let me know.

Finally to generate label.lineidx and feature.lineidx, make use of the following function

sameerpande12 commented 4 years ago

Thanks !

EByrdS commented 3 years ago

@shravan1394, what is the command line you used to generate the caption after having the right features?

Also, could you share the modifications to run_captioning.py to fix the problem with json loads?

The generated label.lineidx and feature.lineidx need to be in the same folder as custom.feature.tsv and custom.label.tsv, right?

zamanmub commented 2 years ago

The information is kind of dispersed in the issues, I will summarize it here for anyone looking in the future.

The features are extracted using the bottom up attention model from https://github.com/peteanderson80/bottom-up-attention. You need to slightly modify the tools/generate_tsv.py to get the label.tsv and feature.tsv. The following code must be added to this file to create the exact format of featue.tsv box_width = boxes[:, 2] - boxes[:, 0] box_height = boxes[:, 3] - boxes[:, 1] scaled_width = box_width / image_width scaled_height = box_height / image_height scaled_x = boxes[:, 0] / image_width scaled_y = boxes[:, 1] / image_height scaled_width = scaled_width[..., np.newaxis] scaled_height = scaled_height[..., np.newaxis] scaled_x = scaled_x[..., np.newaxis] scaled_y = scaled_y[..., np.newaxis] spatial_features = np.concatenate( (scaled_x, scaled_y, scaled_x + scaled_width, scaled_y + scaled_height, scaled_width, scaled_height), axis=1) full_features = np.concatenate((features, spatial_features), axis=1) fea_base64 = base64.b64encode(full_features).decode('utf-8') fea_info = {'num_boxes': boxes.shape[0], 'feature': fea_base64} row = [[image_key, json.dumps(fea_info)]

I am attaching the file that I used for this purpose and to generate label.tsv as well. You might have to change the code depending on your data location and format. tsv_gen.py.zip

I still had some issues with csv Dictwriter generating strings with single quote while json loads requiring it as double quotes in run_captioning.py. I made modifications to run_captioning.py to make it work. If you guys have a better solution, let me know.

Finally to generate label.lineidx and feature.lineidx, make use of the following function

After using this script to generate feature and label tsv files, and after resolving the issue with single-quotes, I received the following error

JSONDecodeError: Expecting value: line 1 column 14 (char 13) error

I solved it by removing .decode('utf-8') from base64.b64encode(full_features).decode('utf-8') in the bottom-up-attention based extractor script

zamanmub commented 2 years ago

@EByrdS you can convert the single quotes to double quotes following https://github.com/microsoft/Oscar/issues/49#issuecomment-797675905 or https://github.com/microsoft/Oscar/issues/49#issuecomment-966316562

Cuberick-Orion commented 2 years ago

The information is kind of dispersed in the issues, I will summarize it here for anyone looking in the future.

The features are extracted using the bottom up attention model from https://github.com/peteanderson80/bottom-up-attention. You need to slightly modify the tools/generate_tsv.py to get the label.tsv and feature.tsv. The following code must be added to this file to create the exact format of featue.tsv box_width = boxes[:, 2] - boxes[:, 0] box_height = boxes[:, 3] - boxes[:, 1] scaled_width = box_width / image_width scaled_height = box_height / image_height scaled_x = boxes[:, 0] / image_width scaled_y = boxes[:, 1] / image_height scaled_width = scaled_width[..., np.newaxis] scaled_height = scaled_height[..., np.newaxis] scaled_x = scaled_x[..., np.newaxis] scaled_y = scaled_y[..., np.newaxis] spatial_features = np.concatenate( (scaled_x, scaled_y, scaled_x + scaled_width, scaled_y + scaled_height, scaled_width, scaled_height), axis=1) full_features = np.concatenate((features, spatial_features), axis=1) fea_base64 = base64.b64encode(full_features).decode('utf-8') fea_info = {'num_boxes': boxes.shape[0], 'feature': fea_base64} row = [[image_key, json.dumps(fea_info)]

I am attaching the file that I used for this purpose and to generate label.tsv as well. You might have to change the code depending on your data location and format. tsv_gen.py.zip

I still had some issues with csv Dictwriter generating strings with single quote while json loads requiring it as double quotes in run_captioning.py. I made modifications to run_captioning.py to make it work. If you guys have a better solution, let me know.

Finally to generate label.lineidx and feature.lineidx, make use of the following function

Thanks for the summary of information here!

To anyone wishing to extract features on custom datasets, stumbled on this thread, and potentially struggling with the caffe environment, I'd recommend using the docker env built from the lxmert.

Follow the instructions to set up the environment, then rewrite the import part of the script following this (at the top of the file).