Closed fawazsammani closed 4 years ago
Hi, we used Oscar pre-trained model (on 6.5M image-text pair) to fine-tune on COCO for image captioning.
Image captioning uses seq2seq attention mask, which is different from the full attention mask used in Oscar pre-training. If we conduct additional pre-training on CC with seq2seq attention mask, it may further improve the performance. However, we directly fine-tune Oscar without additional pre-train, which shows the generalization ability of Oscar for both understanding tasks and generation tasks.
Hope this answers your question.
@xiyinmsu thanks for your reply. I see...so the big improvement in image captioning scores may be due to the additional datasets (other than COCO) used to pre-train the model?
large-scale pre-train data and the pre-train scheme (adding tags, loss functions) are all helpful for the performance improvement.
@xiyinmsu thanks for your reply. I'm just asking to know the performance if trained only with COCO, so it can be compared with image captioning models that don't use VL-pretraining (and only use COCO dataset).
Again, beautiful work!
Hello, and congrats for your brilliant work! I’d like to ask. For image captioning, you mention in the appendix:
Does that mean you only use COCO dataset for pretraining, and not the rest (SBU, Flickr, GQA)? And the cider score of 1.4 is achieved after fine tuning the coco only pretrained model?