Question about the text-video retrieval performance

Eniac-Xie commented 3 years ago

Thanks for your released code!

I am new to "text-video retrieval" task, and wonder why the retrieval result of ClipBERT is much lower than that in paper "Support-set bottlenecks for video-text representation learning" (even lower than the other related works in their paper)? Is any experiment setting different? Thank you.

jayleicn commented 3 years ago

Hi @Eniac-Xie,

Thanks for your question. To answer it, I carefully read the paper "Support-set bottlenecks for video-text representation learning" (referred to as [1]) and present some of the differences I found below.

Experiment setting:

For MSRVTT text-to-video retrieval, we use 7k training, 1k-A test splits, as in the original paper [2] that proposed the 1k-A split. The results are reported in the 1k-A test split. To ensure data integrity, we use another non-overlapping 1k set for hyper-parameter tuning, etc. [1] uses 9k training, 1k-A test splits. This 1k-A test split is the same as ours, but it is used for both reporting the results and hyper-parameter tuning. Thus, for a fair comparison, we did not compare with [1] and some other work (e.g., [3]) using the same splitting and testing strategy.
For ActivityNet Captions text-to-video retrieval, we use the same splits as [1,3], and we compared [3] in the paper. Basically, ClipBERT uses 8 out of 180 seconds at each training step and 40 out of 180 seconds for inference. [1,3] always use the full 180 seconds videos. Please also refer to Sec. 4.4 for this point.

Backbone Models - stronger backbones trained on larger-scale datasets.

Visual encoder: [1] uses two visual encoders, an image ResNet-152 trained on ImageNet and a video R(2+1)D-32 ResNet trained on IG-65M, from [1] Table 2 (a), this gives a huge performance gain. In ClipBERT, we use a single visual encoder, and it is a relatively weaker model -- an image ResNet-50 trained on ImageNet. Though this model is fine-tuned, the resulting benefit might not be better than using stronger backbones (and also significantly more data used to train these backbones). We expect better backbones to further improve the performance.
Text encoder/decoder: [1] uses T5-base, it has around twice #parameters than the BERT-base model we use (this is almost also correct even when we count in the ResNet-50 model), and is trained on a larger scale C4 dataset (745GB) vs. the Wiki+Book Corpus (20GB, used to train BERT).

Training Strategies and Objectives.

We use a rather simple two-way classifier to classify each video, i.e., video-text match or not. This objective does not consider its relation with other videos and captions in the corpus, thus may have sub-optimal performance. [1] uses a contrastive loss with hard negative mining, and a cross-caption loss, which are more suitable for the video-text retrieval task. We expect future work to further explore these better objectives under the ClipBERT framework.

Large-scale Pretraining.

We pretrain ClipBERT on an image-text corpus with 151K images, [1] is pretrained on 1.2M videos with more than 100M clip-captions pairs. We expect a large-scale video-text pertaining would benefit ClipBERT as well, especially considering we are applying it to video-text tasks.

As a summary, there are many differences between ClipBERT and [1,3], thus it is difficult to compare them fairly. Meanwhile, their individual contributions are mostly orthogonal, which means we can always bring the ideas from these papers together to make a stronger one. For example, as noted above, combining the better objectives functions, large-scale video-text pretraining, and stronger backbones used in [1] with ClipBERT's end-to-end sparse sampling strategy.

Hope this answers your question and inspires readers to further improve the ClipBERT model.

Best, Jie

References: [1] Patrick, M., Huang, P.Y., Asano, Y., Metze, F., Hauptmann, A., Henriques, J. and Vedaldi, A., 2020. Support-set bottlenecks for video-text representation learning. ICLR. [2] Yu, Y., Kim, J. and Kim, G., 2018. A joint sequence fusion model for video question answering and retrieval. ECCV. [3] Gabeur, V., Sun, C., Alahari, K. and Schmid, C., 2020, August. Multi-modal transformer for video retrieval. ECCV.

Eniac-Xie commented 3 years ago

Got it, thank you!

jayleicn / ClipBERT

Question about the text-video retrieval performance #1