peri044 / STT

A multi-task model which does image captioning, sentence paraphrasing and cross-modal retrieval.
18 stars 5 forks source link

Preprocessing images #5

Open arunikayadav42 opened 4 years ago

arunikayadav42 commented 4 years ago

Hi @peri044 I wanted to train the STT network with my data. I want to preprocess the images. Can you please point me out to the necessary python script that can help me do the same?

peri044 commented 4 years ago

@arunikayadav42 Check out the extract_image_features.py script in the STT repo. It has the necessary calls for preprocessing input images. The backbone specific preprocessor implementations are in the preprocessing directory

arunikayadav42 commented 4 years ago

@peri044 I had a doubt regarding generating the parapharses. In the README file you have mentioned that we create train_enc.txt and train_dec.txt using the captions_train2014.json file. Then how are those captions mapped to the corresponding image in the train.npy features from the SCAN repository.

peri044 commented 4 years ago

@arunikayadav42 I don't remember the exact data structure details of SCAN data as it has been a while. The way I create paraphrases (train_enc.txt and train_dec.txt) is here . The gist of the process is each image (with an image ID) has 5 captions and you have 20 combinations of sentences tied to the image ID. Using the same image ID, you can extract the SCAN features (downloaded from their repository) for the corresponding image which can be tied to combinations of captions.

arunikayadav42 commented 4 years ago

@peri044 so my only questions is that when we have the 20 combinations and we go on to store them into the tf record files then for each of these combinations we need to have the corresponding image feature and all of them get store to the tfrecord file. Isn't it?

For instance if the image id is coco_train_1 , then the feature from scan data for this image id will be clubbed with each of the 20 combinations for the captions of this image , right?

So at this line https://github.com/peri044/STT/blob/master/data/coco_data_loader.py#L105 . should it not be (img_idx 20, img_idx 20 + 20) instead of (img_idx 5, img_idx 5 + 5)

peri044 commented 4 years ago

Yes. The image feature (for the image id) is replicated for each of the 20 combinations of the captions.

Probably, the data loader script you linked is not the one I used during my experiments. Currently the data loader scripts are all over the place in data folder. I don't remember the exact ones I used due to quick experimentation. You can probably refer to https://github.com/peri044/STT/blob/master/data/coco_extras/coco_feat_stt.py#L50 which writes an image feature for every sentence combination in a tfrecord. All the modules for data loader/TF record generation are in the data directory. They aren't well organized on a model basis (eg: stt, stt-att, scan etc). However, all the components that are used in the experiments of the paper can be found (scattered) in the data directory.