microsoft / Oscar

Oscar and VinVL
MIT License
1.04k stars 251 forks source link

Pre-training for image captioning #14

Closed fawazsammani closed 4 years ago

fawazsammani commented 4 years ago

Hello, and congrats for your brilliant work! I’d like to ask. For image captioning, you mention in the appendix:

we directly fine-tune Oscar for image captioning on COCO without additional pre-training on Conceptual Captions

Does that mean you only use COCO dataset for pretraining, and not the rest (SBU, Flickr, GQA)? And the cider score of 1.4 is achieved after fine tuning the coco only pretrained model?

xiyinmsu commented 4 years ago

Hi, we used Oscar pre-trained model (on 6.5M image-text pair) to fine-tune on COCO for image captioning.

Image captioning uses seq2seq attention mask, which is different from the full attention mask used in Oscar pre-training. If we conduct additional pre-training on CC with seq2seq attention mask, it may further improve the performance. However, we directly fine-tune Oscar without additional pre-train, which shows the generalization ability of Oscar for both understanding tasks and generation tasks.

Hope this answers your question.

fawazsammani commented 4 years ago

@xiyinmsu thanks for your reply. I see...so the big improvement in image captioning scores may be due to the additional datasets (other than COCO) used to pre-train the model?

xiyinmsu commented 4 years ago

large-scale pre-train data and the pre-train scheme (adding tags, loss functions) are all helpful for the performance improvement.

fawazsammani commented 4 years ago

@xiyinmsu thanks for your reply. I'm just asking to know the performance if trained only with COCO, so it can be compared with image captioning models that don't use VL-pretraining (and only use COCO dataset).

Again, beautiful work!