Closed Alen-T closed 2 years ago
Thank you for your question and for taking interest in our work! The exact setting for each dataset used to obtain our results is provided in the README. Generally speaking, things that can affect performance are video frame sampling during training and seed. We used compressed videos during training for speed up (code provided in preprocess/compress_video.py
), default seed and lr=1e-5 and 3e-5 for LSMDC and MSR-VTT-9K respectively. However, your performance gap might suggest there is another hyper that is different for example batch_size (we use batch_size=32 on a single GPU). I hope this helps!
Thank you for your reply. We adopted default settings such as seed, batch size. We also used a single GPU, including A100 (40G), V100 (32G), A40 (48G), and 3090Ti (24G), which could not reproduce the results in the paper. Can you provide the specific version settings of your application package?
The GPU we used is a Titan RTX (24GB). Regarding package versions, as mentioned in our README, we used PyTorch 1.8.1, Transformers 4.6.1 and OpenCV 4.5.3. I also want to make sure you are using the commands listed under https://github.com/layer6ai-labs/xpool#training to train your models. We have tested our paper results across different machines and they are quite similar across runs. Thanks!
Closing due to inactivity
@Alen-T , sorry to disturb you. When I reproduce the results on LSMDC, I find the the value of MeanR is around 200, which has confused me for a long time. I want to know the source of 118081 total videos (training、validation and testing). And the LSMDC 2016 dataset that I apply in the official site has 10079 , 7408 and 1000 for training, validation and testing, while the total number of videos is not equal to 118081. I want to know the reason of quantity difference. Looking forward to your apply. Thanks in advance.
Hello, thank you very much for your work and code, I have a few questions that I am puzzled about, and I hope you can help me. I trained the small diet many times according to the parameters you gave, but the best results obtained on the MSRVTT and LSMDC test sets are as follows:
Text-Video in MSRVTT(9K train): R@1: 45.4, R@5: 72.2, R@10: 81.3, MdR: 2.0
Text-Video in LSMDC: R@1: 24.12, R@5: 42.64 R@10: 51.85 MdR: 9.0
And the results given in the paper are:
Text-Video in MSRVTT (9K train): R@1: 46.9, R@5: 72.8, R@10: 82.2, MdR: 2.0
Text-Video in LSMDC: R@1: 25.2, R@5: 43.7 R@10: 53.5, MdR: 8.0
How can I get the results in your paper? Thank you again, and look forward to your answers.