m-bain / frozen-in-time

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [ICCV'21]
https://arxiv.org/abs/2104.00650
MIT License
342 stars 43 forks source link

Can you share some recordings of your experiments #39

Open realTaki opened 2 years ago

realTaki commented 2 years ago

Can you share some recordings of your experiments like some graphs in neptune.ai or other logs tracking the performance/loss changes in training steps.

I would like to compare the effects of some configurations(e.g. batch size) on training convergence in depth. I think this uses a contrastive loss that depends on a similarity matrix, may be effected by batch size and converges slower in a smaller batch size. In your experiments, it was not using large batch sizes and may not achived the best performance yet. I think I want to try something haha~

m-bain commented 2 years ago

Hi, sure you can see some runs for MSRVTT here: https://app.neptune.ai/m-bain/frozen/experiments?split=tbl&dash=charts&viewId=95e7e8f0-79f1-48a4-9bd5-e1017c21309b

Yeah smaller batch size will take longer to converge -- and intuitively I would think it gives worse performance due to n^2 comparisons.

However, I find for these small datasets that small batch size does really well if you tune the learning rate accordingly, maybe since its like more augmentation. All my best results are with batch size 8-16. I think during pretraining bigger is better just because training is hard to converge. Let me know how you get on :)

bryant1410 commented 2 years ago

For the sake of sharing results, I have reproduced the pre-training on CC3M+WebVid with 1-frame batch size 512 (instead of 96) and 4-frame batch size 128 (instead of 24). On MSR-VTT (1k-A split) zero-shot I got ~2% absolute improvement in R@1, R@5, and R@10. On MSR-VTT fine-tuning (1k-A split) (can't remember the batch size but probably 128), I got +2% in R@1 while R@5 and R@10 where essentially the same.