Closed vateye closed 2 years ago
Hi yes for the 4-frame stage: 4 frames for WebVid 2M and 1 frame for CC3M :)
Cool, Thanks. By the way, did you have perform the experiment that only use WebVid 2M during the 4-frames stage? Or some ablation on "Joint Image-video training"?
We have some results when training on webvid-only, and image-only in the recent arxiv https://arxiv.org/abs/2104.00650. I didnt try 1-frame(CC+Webvid) -> 4-frame(WebVid only). Biggest amount of performance increase on video retrieval is by having a really strong vision-lang representation, doesn't matter that much if images or video, hence why CLIP methods are SOTA now. I think pre-training on more diverse datasets will always give best results for video retrieval, but maybe its different for other video tasks.
Thanks!
Hi,
I have a question about the curriculum learning. For the 1 frame pretraining, both CC3M and WebVid 2M dataset are used. But when finetuning on 4 frames stage, did you use both video and image for joint pretraining (4 frames for WebVid 2M and 1 frame for CC3M)? Since I cannot find any experimental details for "Joint image-video training" in the paper.
Thanks in advance.