Closed SCZwangxiao closed 1 year ago
We did not conduct this experiment. If captions are generated from 10M data, the combination does not really tackle the imbalance between CLIP's training data and post-pretraining data. As a result, an "overfitting" phenomenon similar to Fig 1 is expected.
Hi, thank you so much for the project. I wonder what is the results of OFA captions + HD-VILA-10M ?