krasserm / fairseq-image-captioning

Transformer-based image captioning extension for pytorch/fairseq
Apache License 2.0
313 stars 56 forks source link

What if using a resnet-152 when extracting features instead of the Faster-RCNN? #17

Open Kyubyong opened 4 years ago

Kyubyong commented 4 years ago

Hi, I wonder if we can use the extracted features from the resnet-152 model, not from the Faster-RCNN because the former is easy to implement.

Kyubyong commented 4 years ago

Oh, never mind. Now I see there is an option for InceptionNet. Then were the pretrained models (checkpoint 20, 24) trained using the --featuregrid' option? Or theobj` option?

krasserm commented 4 years ago

They were trained as described in https://github.com/krasserm/fairseq-image-captioning/blob/master/README.md#training

Kyubyong commented 4 years ago

Thanks for your confirmation. Have you checked the performance of the pretrained model provided in https://github.com/krasserm/fairseq-image-captioning/tree/wip-train-inception? I'm curious how good the grid based model is compared to the obj. based model.

krasserm commented 4 years ago

The object-based model is significantly better but when I trained the grid-based model long time ago I didn't really tune hyper-parameters. So it may be worth re-training it with hyper-parameters similar to object-based training (lr, warmup, ...), at least as starting point. On the other hand, most image captioning papers report object-based approaches to be superior to grid-based approaches.