How many train steps are needed to get the performance of the paper when finetuning TVR dataset?

linjieli222 / HERO

Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"

https://arxiv.org/abs/2005.00200

MIT License

230 stars 34 forks source link

Closed liveseongho closed 3 years ago

liveseongho commented 3 years ago

Hi, I'm trying to fintune TVR dataset with HERO pretrained model. But with 5000 or 10000 train steps, I failed to reach the performance of the paper.

How many train steps are needed to finetune TVR dataset?
Is the number of GPU is critical to performance? I'm running this finetuning with 4 gpus.

Also, the paper doesn't describe any about hard negative sampling, but it seems to be important.

Have you done ablation study about hard negatives? Could you share your experience?

linjieli222 commented 3 years ago

Hi,

Thanks for your interests in this project.

We have provided with the best training config. The performance reported in the paper is from 5000 steps on 8 GPUs.
GPUs will affect the performance, as our hard negative sampling is conducted across all GPUs. So with less GPUs, you are seeing less examples in a single training steps.
For hard negatives, we strictly followed the original TVR work on how the model get trained. Please have a check on their repo. Thanks.

liveseongho commented 3 years ago

Thanks for your quick response! 😃