jayleicn / singularity

[ACL 2023] Official PyTorch code for Singularity model in "Revealing Single Frame Bias for Video-and-Language Learning"
https://arxiv.org/abs/2206.03428
MIT License
130 stars 14 forks source link

Could there be an unfair comparision? #21

Closed SCZwangxiao closed 1 year ago

SCZwangxiao commented 1 year ago

Thanks for the great work and open source. "Single frame bias" is an interesting phenomenon.

However, I am quite confused with the experiment settings on training datasets. Concretely, in the paper you claimed:

a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training.

For a fair comparison, you should have applied the same video datasets to all models, and your model applies the single-frame training strategy while others remain unchanged. In your paper, the datasets of baselines are diverse, and many baselines have no high-quality image-text datasets.

Also, #PT may not be fair enough to evaluate data abundance, because pairs in image-text datasets can be regarded as independent, while those in video-text datasets are not.

jayleicn commented 1 year ago

Hi @SCZwangxiao,

Thanks for your question!

For a fair comparison, you should have applied the same video datasets to all models, and your model applies the single-frame training strategy while others remain unchanged. In your paper, the datasets of baselines are diverse, and many baselines have no high-quality image-text datasets.

Yes, ideally, for fair comparison, we need to compare all methods under the same pre-training datasets, but this is often not feasible as many method does not release their code and/or training configs. In addition, reproduction itself is also quite difficult even with original code. To show fair comparison, we have the 5M model that uses the same pre-training datasets as previous approaches Frozen and AlignPrompt, we notice our approach works much better than these two fairly compared approaches. Details in Table 1 of the paper.

Also, #PT may not be fair enough to evaluate data abundance, because pairs in image-text datasets can be regarded as independent, while those in video-text datasets are not.

Yes, we agree that #video-text and #image-text are not directly comparable, but they should roughly reflect what data scale used. We have listed all these data sources behind #PT for readers that are interested in more detailed comparison. You are very welcome to leave any suggestion for us to further improve the tables. Thanks!

Best, Jie