microsoft / XPretrain

Multi-modality pre-training
Other
467 stars 36 forks source link

Reproducing the result of CLIP-ViP performance on MSRVTT #13

Closed justopit closed 1 year ago

justopit commented 1 year ago

Can't reproduce the result of CLIP-ViP performance on MSRVTT. I used the default config file with epoch=100 and bs=16. Or epochs=5, bs=128 in the paper. The best perform of t2vR1 and v2tR1 are both 49+

HellwayXue commented 1 year ago

Hi, may I ask the GPU number of your experiment? We use 8 GPUs and batchsize=16 on each GPU, so the overall batchsize is 128. For epochs, as there are 20 sentences for each video, so we set epoch=100 in the config to reach the actual epoch=5. Please check these or provide your training log.

justopit commented 1 year ago

Under the setting of overall batchsize=48 * 3(N_gpu) and 100 epochs,R1 of t2v is 50+, but R1 of v2t is just 48+. Is it the reason of N_gpu? So strange.

HellwayXue commented 1 year ago

N_gpu may have a little effect, but your number is already close to ours. We did not include v2t results in our paper, but I find our training log, the result is: t2v recall@1: 50.1 t2v recall@5: 74.8 t2v recall@10: 84.6 v2t recall@1: 49.1 v2t recall@5: 76.2 v2t recall@10: 84.0

justopit commented 1 year ago

OKk. Thanks a lot. I have reproduced the results.