About MAC on GQA-like images

marcozov commented 5 years ago

Hello,

I would like to run the model on images that are not in the GQA dataset, but as if they were in GQA (basically I just want replace some images of the dataset with other images, and keep asking the same questions). For running the model on GQA I simply followed the instructions on the GQA branch, which consist in downloading the spatial features and the objects features and then to merge them.

But how do I extract those features from other images? I saw the extract_features.py script, but I don't fully understand how to use it in order to extract both spatial and object features. And what about the other parameters (image_height, image_width, model_stage, batch_size)? What should I use in order to extract features in the same way as the ones that you generated and put available to download?

Thanks in advance.

dorarad commented 5 years ago

Hi, it should be definitely possible to extend the network to work on new images and indeed you'll have to extract features first (either spatial or object-based or both).

You'll need to know the specification of the images you have e.g. their height and width. The model stage is the stage of the pretrained pytorch resent you'd like to use to extract the features (4 by default). You can choose any batch size you'd like based on the size of the GPU you have.

The code goes over all the png file images in the image directory (--input_image_dir): https://github.com/stanfordnlp/mac-network/blob/gqa/extract_features.py#L69 so all you'll need to do is to put the additional images in the directory that you use for that flag and run it and it should go smoothly.

Note that this code extracts only spatial features. For object-based features you'll need to run a separate object detector as in https://github.com/facebookresearch/Detectron on your images and once you have the extracted features you'll be able to run mac on them (similarly to how it works currently on GQA).

Please let me know if you have any other questions! :)

marcozov commented 5 years ago

Thanks for your reply!

The images may have different size, as it happen in GQA: do those parameter refer to how the images are resized? Also, I would like to use exactly the same setup as the one that has been used in order to get the features files that are available on the website ( https://cs.stanford.edu/people/dorarad/gqa/download.html ): which object extractor did you use exactly? Was it pre-trained on COCO or ImageNet?

Thank you again.

dorarad commented 5 years ago

np! Regarding the image height/width right you're correct, the height and width flags are actually the dimensions after resize https://github.com/stanfordnlp/mac-network/blob/master/extract_features.py#L90, not the original (I believe that's one of the common approaches to resize all images to a fixed size before extracting features from them)

For the object detector: I used https://github.com/peteanderson80/bottom-up-attention trained on all the images/scene graphs in the GQA training set.

marcozov commented 5 years ago

Thanks.

Sorry if I insist, but: 1- what exact dimensions did you put for resize? 2- do you have the weights saved anywhere, so that I could avoid re-training the model from scratch? If that's not the case: there are several degrees of freedom in the procedure, do you have the code that was used for training? 3- did you convert GQA annotations (scene graphs) to VisualGenome format or did you use VisualGenome directly? As far as I understood GQA scene graphs are taken from VisualGenome: did you just split the latter dataset according to GQA train/validation split?

Thank you again.

dorarad commented 5 years ago

Hi, happy to answer any questions!

I used 224 (set it also as the default)
For the spatial features you could get the weights directly from pytorch https://github.com/stanfordnlp/mac-network/blob/master/extract_features.py#L34 I used the standard pretrained resnet provided as part of torchvision. For the object-based features currently I didn't provide weights for the model but got several requests for that so planning to do that. The code I used for training was https://github.com/peteanderson80/bottom-up-attention trained on the images+scene graphs in the training set of GQA.
GQA scene graphs are a significantly cleaner version of visual genome, where in particular they are defined over a closed ontology (consolidating synonyms and reducing a lot of the noise). After doing that they also have been split into train/val. All the details can be found about the in the paper! particularly https://arxiv.org/pdf/1902.09506.pdf page 4 is most relevant.

Please let me know if you have further questions!

marcozov commented 5 years ago

Thanks for the answer. I really hope you will make weights available, because I have already tried training other object detectors on scene graphs without achieving any positive result.

dorarad commented 5 years ago

It might take some time, but in the meantime - I used exactly the same code as in https://github.com/peteanderson80/bottom-up-attention and the only change I did was the training set itself: I changed this list here: https://github.com/peteanderson80/bottom-up-attention/blob/master/data/genome/train.txt to include only the ids of images from the gqa training set (~70k out of the original 110k). Same parameters and everything. Trained for about 5 days on 4 Titan X gpus in parallel and then extracted the features using https://github.com/peteanderson80/bottom-up-attention/blob/master/tools/generate_tsv.py which generates a tsv file of all the features, and then I saved them as is just in a h5 format instead (just changing the file format from tsv to h5 to comply with my code, the features were kept fully identical). Hope it could help in the meantime! I will let you know when releasing weights!

marcozov commented 5 years ago

Thank you very much! Last question: what performance do you obtain with the bottom-up-attention for object detection on GQA dataset? I guess you measured the performance on the validation split.. do you have some numbers (for instance, the mAP) ?

dorarad commented 5 years ago

hi, I ran their script for evaluation one time in november I remember numbers for object detection were quite low but i don't have precise numbers currently :/ I don't think however mAP scores are a good indicator for the usefulness of features to a VQA task since there are many closely related objects (let's say a table and a desk) and even if the object detector doesn't manage to distinguish between them with high accuracy it won't necessarily affect the VQA end task

thaolmk54 commented 3 years ago

Hi @dorarad, Thanks for the great repo. Do you have any update on releasing pretrained weights for object detection? It would be great if you can share it.

dorarad commented 3 years ago

Hi Thao, Marco, thanks a lot for the interest. unfortunately no update yet about releasing the weights - I'm having some trouble accessing some of my older files but I hope to resolve it.

Hi @dorarad https://github.com/dorarad,

Thanks for the great repo. Do you have any update on releasing pretrained weights for object detection? It would be great if you can share it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/mac-network/issues/36#issuecomment-719626039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFNR43EBG4C7RUDMOYMETU3SNLMSRANCNFSM4H3XO6GQ .

stanfordnlp / mac-network

About MAC on GQA-like images #36