m-bain / frozen-in-time

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [ICCV'21]
https://arxiv.org/abs/2104.00650
MIT License
349 stars 44 forks source link

Curriculum Learning and Video-Image Joint Training #45

Closed vateye closed 2 years ago

vateye commented 2 years ago

Hi,

I have a question about the curriculum learning. For the 1 frame pretraining, both CC3M and WebVid 2M dataset are used. But when finetuning on 4 frames stage, did you use both video and image for joint pretraining (4 frames for WebVid 2M and 1 frame for CC3M)? Since I cannot find any experimental details for "Joint image-video training" in the paper.

Thanks in advance.

m-bain commented 2 years ago

Hi yes for the 4-frame stage: 4 frames for WebVid 2M and 1 frame for CC3M :)

vateye commented 2 years ago

Cool, Thanks. By the way, did you have perform the experiment that only use WebVid 2M during the 4-frames stage? Or some ablation on "Joint Image-video training"?

m-bain commented 2 years ago

We have some results when training on webvid-only, and image-only in the recent arxiv https://arxiv.org/abs/2104.00650. I didnt try 1-frame(CC+Webvid) -> 4-frame(WebVid only). Biggest amount of performance increase on video retrieval is by having a really strong vision-lang representation, doesn't matter that much if images or video, hence why CLIP methods are SOTA now. I think pre-training on more diverse datasets will always give best results for video retrieval, but maybe its different for other video tasks.

vateye commented 2 years ago

Thanks!