Closed Eniac-Xie closed 3 years ago
Hi @Eniac-Xie,
Thanks for your question. To answer it, I carefully read the paper "Support-set bottlenecks for video-text representation learning" (referred to as [1]) and present some of the differences I found below.
Experiment setting:
Backbone Models - stronger backbones trained on larger-scale datasets.
Training Strategies and Objectives.
Large-scale Pretraining.
As a summary, there are many differences between ClipBERT and [1,3], thus it is difficult to compare them fairly. Meanwhile, their individual contributions are mostly orthogonal, which means we can always bring the ideas from these papers together to make a stronger one. For example, as noted above, combining the better objectives functions, large-scale video-text pretraining, and stronger backbones used in [1] with ClipBERT's end-to-end sparse sampling strategy.
Hope this answers your question and inspires readers to further improve the ClipBERT model.
Best, Jie
References: [1] Patrick, M., Huang, P.Y., Asano, Y., Metze, F., Hauptmann, A., Henriques, J. and Vedaldi, A., 2020. Support-set bottlenecks for video-text representation learning. ICLR. [2] Yu, Y., Kim, J. and Kim, G., 2018. A joint sequence fusion model for video question answering and retrieval. ECCV. [3] Gabeur, V., Sun, C., Alahari, K. and Schmid, C., 2020, August. Multi-modal transformer for video retrieval. ECCV.
Got it, thank you!
Thanks for your released code!
I am new to "text-video retrieval" task, and wonder why the retrieval result of ClipBERT is much lower than that in paper "Support-set bottlenecks for video-text representation learning" (even lower than the other related works in their paper)? Is any experiment setting different? Thank you.