krasserm / fairseq-image-captioning

Transformer-based image captioning extension for pytorch/fairseq
Apache License 2.0
316 stars 56 forks source link

Use Faster-RCNN directly #9

Open adrelino opened 4 years ago

adrelino commented 4 years ago

Following up on https://github.com/pytorch/fairseq/issues/759#issuecomment-589498214, it would be great if Faster-RCNN could be used directly, so we could input images instead of pre-computed features from MS-COCO. Regarding the specific Faster-RCNN PyTorch implementation, torchvision only provides it with a resnet-50 backbone, while detectron2 also has a resnet-101 backbone.

This is similar to what was used as feature extractor in bottom-up-attention, but there are however some differences between the original Caffe ResNet-101/faster_rcnn model and the one in detectron2.

The project airsplay/py-bottom-up-attention tries to address these differences. In adrelino/py-bottom-up-attention-extracted I currently try to split it apart from detectron2 that it is a fork of. However I am not sure https://github.com/airsplay/py-bottom-up-attention/issues/1#issue-570188724 if the original pre-trained model weights can actually be loaded.

How important do you think is it to re-generate features using exactly the same weights as in bottom-up-attention?

krasserm commented 4 years ago

Thanks for your efforts on this @adrelino, getting it integrated into this project would be a great addition! I don't think it's critical to have exactly the same weights as long as another pre-trained model gives similar results as bottom-up-attention.

A few months ago, on branch wip-train-inception, I started to work on a modification that includes an Inception-V3 feature extractor into the captioning model instead of using pre-computed features. Code on this branch jointly trains the feature extractor with the captioning model but there should also be an option to freeze it during training. Unfortunately, this is based on a rather old version of this project and probably needs some re-write but I think a Faster-RCNN could be integrated following a similar path. Open to alternative suggestions of course.

In order to find out whether a Faster-RCNN with (slightly) different weights leads to similar captioning results we could start with pre-computing features in a pre-processing step and if the results look good move on to integrate the Faster-RCNN model directly into the captioning model. WDYT?

adrelino commented 4 years ago

Thanks for your efforts on this @adrelino, getting it integrated into this project would be a great addition! I don't think it's critical to have exactly the same weights as long as another pre-trained model gives similar results as bottom-up-attention.

Great, because it is quite hard to recreate exactly the same features: https://github.com/airsplay/py-bottom-up-attention/issues/1#issuecomment-591737188. Also, it is now more common to use a FPN backbone instead of a C4 one https://github.com/facebookresearch/detectron2/blob/master/MODEL_ZOO.md#common-settings-for-coco-models.

A few months ago, on branch wip-train-inception, I started to work on a modification that includes an Inception-V3 feature extractor into the captioning model instead of using pre-computed features. Code on this branch jointly trains the feature extractor with the captioning model but there should also be an option to freeze it during training. Unfortunately, this is based on a rather old version of this project and probably needs some re-write but I think a Faster-RCNN could be integrated following a similar path. Open to alternative suggestions of course.

Yes I saw that branch. So looks like in https://github.com/krasserm/fairseq-image-captioning/blob/b3f206078151939d13e251747cda3956cbc65c04/model/inception.py#L83-L85

you had to copy the forward method of torchvision's Inception until just before the final pooling layer at https://github.com/pytorch/vision/blob/2f433e0a4233b92627465d8317b40adf10c2ad9d/torchvision/models/inception.py#L169-L172.

In detectron2 the forward method of R-CNN is split up into multiple function calls that one can also call manually one after the other, which means not so much code needs to be copied to partially execute it.

In order to find out whether a Faster-RCNN with (slightly) different weights leads to similar captioning results we could start with pre-computing features in a pre-processing step and if the results look good move on to integrate the Faster-RCNN model directly into the captioning model. WDYT?

Sounds like a good plan.

  1. To pre-compute the features, it is possible to just partially execute a model https://detectron2.readthedocs.io/tutorials/models.html#partially-execute-a-model without modifying it.
  2. To directly integrate and be able to train/fine-tune a model, one has to modify and register the final roi_head so that it also returns the features along with the instances it returns. https://detectron2.readthedocs.io/tutorials/write-models.html#write-models

I just gathered some experience in doing both and will make a pull request once it is ready. I would suggest just using the detectron2 default R101-FPN trained on COCO train2017 and evaluated on val2017 as a start: https://github.com/facebookresearch/detectron2/blob/master/MODEL_ZOO.md#common-settings-for-coco-models.

krasserm commented 4 years ago

Sounds great! :+1: on using the detectron2 model with the R101-FPN backbone but I'm not sure if pre-training on MS-COCO train2017 will work in combination with this project. Here, Karpathy splits are used which re-partitions the MS-COCO train2014/val2014 dataset into a train/valid/test set. If the MS-COCO train2017 set is a subset of the Karpathy train set then we can use the pre-trained Faster-RCNN as-is, otherwise, we would leak examples that are used for captioning model validation and testing into the Faster-RCNN training set. Do you have a rough estimate of the effort re-training the Fatser-RCNN on the Karpathy train set?

ruotianluo commented 4 years ago

I think https://gitlab.com/vedanuj/vqa-maskrcnn-benchmark is what you are looking for. And also, the bottom up feature is trained on visual genome with the attribute head. From gossiping with other people, the attribute head is important. If you train a faster rcnn on coco only, it won't work as well as I was told.(disclaimer here)

krasserm commented 4 years ago

Yes, exactly! Thanks for the pointer @ruotianluo, and your hint regarding the importance of training with the attribute head.

Kyubyong commented 4 years ago

Hi @krasserm , Thanks for this project! I wonder if you have any plans to apply the faster-rcnn directly so we can input an image, not its prebuilt features.

krasserm commented 4 years ago

@Kyubyong not sure about @adrelino's plans/progress here. I'm still interested in this feature but have other priorities at the moment. I think what @ruotianluo describes is the way to go forward.