jayleicn / ClipBERT

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.
https://arxiv.org/abs/2102.06183
MIT License
698 stars 85 forks source link

Question about the text-video retrieval performance #1

Closed Eniac-Xie closed 3 years ago

Eniac-Xie commented 3 years ago

Thanks for your released code!

I am new to "text-video retrieval" task, and wonder why the retrieval result of ClipBERT is much lower than that in paper "Support-set bottlenecks for video-text representation learning" (even lower than the other related works in their paper)? Is any experiment setting different? Thank you.

jayleicn commented 3 years ago

Hi @Eniac-Xie,

Thanks for your question. To answer it, I carefully read the paper "Support-set bottlenecks for video-text representation learning" (referred to as [1]) and present some of the differences I found below.

Experiment setting:

Backbone Models - stronger backbones trained on larger-scale datasets.

Training Strategies and Objectives.

Large-scale Pretraining.

As a summary, there are many differences between ClipBERT and [1,3], thus it is difficult to compare them fairly. Meanwhile, their individual contributions are mostly orthogonal, which means we can always bring the ideas from these papers together to make a stronger one. For example, as noted above, combining the better objectives functions, large-scale video-text pretraining, and stronger backbones used in [1] with ClipBERT's end-to-end sparse sampling strategy.

Hope this answers your question and inspires readers to further improve the ClipBERT model.

Best, Jie

References: [1] Patrick, M., Huang, P.Y., Asano, Y., Metze, F., Hauptmann, A., Henriques, J. and Vedaldi, A., 2020. Support-set bottlenecks for video-text representation learning. ICLR. [2] Yu, Y., Kim, J. and Kim, G., 2018. A joint sequence fusion model for video question answering and retrieval. ECCV. [3] Gabeur, V., Sun, C., Alahari, K. and Schmid, C., 2020, August. Multi-modal transformer for video retrieval. ECCV.

Eniac-Xie commented 3 years ago

Got it, thank you!