In CLIP-ViP, what is the results of OFA captions + HD-VILA-10M?

microsoft / XPretrain

Multi-modality pre-training

Other

471 stars 37 forks source link

In CLIP-ViP, what is the results of OFA captions + HD-VILA-10M? #10

Closed SCZwangxiao closed 1 year ago

SCZwangxiao commented 1 year ago

Hi, thank you so much for the project. I wonder what is the results of OFA captions + HD-VILA-10M ?

HellwayXue commented 1 year ago

We did not conduct this experiment. If captions are generated from 10M data, the combination does not really tackle the imbalance between CLIP's training data and post-pretraining data. As a result, an "overfitting" phenomenon similar to Fig 1 is expected.