Closed May2333 closed 9 months ago
We use pre-trained ckpt and find-tune on downstream tasks in Table 1. For text-to-retrieval, a larger batch size may lead to a better performance.
Here is the log of our re-implementation (PyTorch 1.12 / CUDA 11.3):
fine-tuned ckpt: ep=18 / ep=20
Thanks for your reply! I'll try with more epochs!
Hi,
Thank you for your great work! But I would like to know, are the models in Table 1 fine-tuned by the downstream dataset (TGIF-frame or Didemo)? What are the details of the settings in Table 1 (there is only "All variants are pre-trained on WebVid [3] for 5 epochs" in your paper)? Because I trained your model with spatial-focused image feature targets (5 epochs for webvid-2.5M, 10 epochs for Didemo) and found that the R1 for the Didemo dataset is 25.23 (35.4 in your paper). Am I missing something here? Any reply would be helpful!
Best wishes!