Some doubt about contrastive loss and the output of BertImgForPreTraining

Hi Oscar Team,

I read your superior paper some times and was interested in 'contrastive loss' mentioned in paper, but I can't find it in the source code. (1)Specifically ,I noticed the model used in run_oscarplus_pretrained.py is BertImgForPreTraining ,so I think it is the model class which is used for pretraining .However,I find the code of this class is similar to BERT (get sequence_output and pool_output from encoder ,then process them by BertPreTrainingHeads to get prediction_scores and seq_relationship_score ),it seems that the only difference is that BertImgForPreTraining supports image input but BERT doesn't .

Because there is only masked token loss in BERT and they're similar, I can't find where contrasive loss is .

(2)If the output of BertImgForPreTraining is just like BERT, it seems that it could process only language problems ,but it's a VLP model class ,and through its training method that judge wether object tags are changed to optimize contrastive loss ,I think its output can reflect the ability about image-text-alignment in a certain degree.I want to know which output or model class I should choose to reflect it.

In paper ,you mentioned 'apply a fully-connected (FC) layer on the top of [CLS] as a binary classifier to predict wether the pair is polluted', I only find binary classifier in ImageBertForSequenceClassification, but it is used for Image-Text Retrieval and NLVR but not pretraining , which puzzles me a lot.

microsoft / Oscar

Some doubt about contrastive loss and the output of BertImgForPreTraining #190