jackroos / VL-BERT

Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".
MIT License
738 stars 110 forks source link

Which config file should I use when doing pre-training and fine-tuning on each task to reproduce the paper results? #1

Closed yangapku closed 4 years ago

yangapku commented 4 years ago

Hi. I have noticed that there are several config files in cfg/pretrain, cfg/vqa and cfg/refcoco (like there are 3 base-model configs base_e2e_16x16G_fp16.yaml, base_prec_4x16G_fp32.yaml, base_prec_withouttextonly_4x16G_fp32.yaml existing in cfg/pretrain) Can you provide more details about the differences of these configs? If I want to reproduce the paper results, which configs among them should I use? Thank you!

jackroos commented 4 years ago

Hi. I have noticed that there are several config files in cfg/pretrain, cfg/vqa and cfg/refcoco (like there are 3 base-model configs base_e2e_16x16G_fp16.yaml, base_prec_4x16G_fp32.yaml, base_prec_withouttextonly_4x16G_fp32.yaml existing in cfg/pretrain) Can you provide more details about the differences of these configs? If I want to reproduce the paper results, which configs among them should I use? Thank you!

  1. The format of config name is <MODEL/SETTING>_<NUM_GPUxGPU_MEM>_<fp16/32>.
  2. In pretraining configs, 'e2e' means Fast RCNN is tuned during pre-training, while 'prec' means Fast RCNN is fixed and precomputed (which is corresponding to setting(d) in table4 of paper), besides, base_prec_withouttextonly_4x16G_fp32.yaml is corresponding to setting(c).
  3. In finetuning experiments, you should download pre-trained models (See PREPARE_PRETRAINED_MODELS.md), and then use corresponding configs. The pre-trained model path is specified in NETWORK.PARTIAL_PRETRAIN of config yaml.

Thanks!

coldmanck commented 4 years ago

Hi @jackroos

May I know why TRAIN.GRAD_ACCUMULATE_STEPS is missing in cfgs/vqa/base_4x16G_fp32.yaml? Is this default to be 2 (or 4)? Thank you!

jackroos commented 4 years ago

@coldmanck Actually, by default, TRAIN.GRAD_ACCUMULATE_STEPS is set to 1, in cfgs/vqa/base_4x16G_fp32.yaml, the total batch size is 4*64=256 (large enough), so we don't need to accumulate gradient in this case. Thanks!

coldmanck commented 4 years ago

Hi @jackroos

Thank you for your response! May I know how batch size works in your code? For example, I understand in 4*64=256, 64 comes from config.TRAIN.BATCH_IMAGES but why multiplied by 4? Also how does it work with TRAIN.GRAD_ACCUMULATE_STEPS?

I am guessing the final batch size is calculated by GRAD_ACCUMULATE_STEPS * BATCH_IMAGES * (number of GPUs)? (Very possibly I am wrong)

jackroos commented 4 years ago

@coldmanck Yeah! The 'actual' batch size is exactly what you guess.

coldmanck commented 4 years ago

Thanks a lot!

coldmanck commented 4 years ago

@jackroos But I find some cfgs does not with actual batch size 256 as you mentioned in the paper, such as in cfgs/vcr/large_q2a_4x16G_fp16.yaml, BATCH_IMAGES is 4, GRAD_ACCUMULATE_STEPS is 4 and assuming 4 GPUs are used, the 'actual' batch size is 64. Should we modify any of these hyper-parameters to match your batch size of 256 in order to reproduce your result?

jackroos commented 4 years ago

@coldmanck In VCR, the batch size is 4x larger than the batch size in the config, since for each question, there are 4 answer candidates.

coldmanck commented 4 years ago

I see. Thanks again 👍