Closed SCZwangxiao closed 1 year ago
Hi @SCZwangxiao,
Thanks for your question!
For a fair comparison, you should have applied the same video datasets to all models, and your model applies the single-frame training strategy while others remain unchanged. In your paper, the datasets of baselines are diverse, and many baselines have no high-quality image-text datasets.
Yes, ideally, for fair comparison, we need to compare all methods under the same pre-training datasets, but this is often not feasible as many method does not release their code and/or training configs. In addition, reproduction itself is also quite difficult even with original code. To show fair comparison, we have the 5M model that uses the same pre-training datasets as previous approaches Frozen and AlignPrompt, we notice our approach works much better than these two fairly compared approaches. Details in Table 1 of the paper.
Also, #PT may not be fair enough to evaluate data abundance, because pairs in image-text datasets can be regarded as independent, while those in video-text datasets are not.
Yes, we agree that #video-text and #image-text are not directly comparable, but they should roughly reflect what data scale used. We have listed all these data sources behind #PT
for readers that are interested in more detailed comparison. You are very welcome to leave any suggestion for us to further improve the tables. Thanks!
Best, Jie
Thanks for the great work and open source. "Single frame bias" is an interesting phenomenon.
However, I am quite confused with the experiment settings on training datasets. Concretely, in the paper you claimed:
For a fair comparison, you should have applied the same video datasets to all models, and your model applies the single-frame training strategy while others remain unchanged. In your paper, the datasets of baselines are diverse, and many baselines have no high-quality image-text datasets.
Also,
#PT
may not be fair enough to evaluate data abundance, because pairs in image-text datasets can be regarded as independent, while those in video-text datasets are not.