I was looking at the cfgs file for VQA and noticed different hyperparameters than in the appendix of the paper.
For instance, 5 epochs instead of 20, 500 warmup steps instead of 2000, smaller learning rate, ...
Should we follow -- in this and other tasks -- the values in the repository or the ones in the paper?
Also, are inputs not truncated to a maximum length during fine-tuning?
You can fine-tune it with 20 epochs, but we found 5 epoch is enough for pre-trained VL-BERT. 20 epochs setting is for comparison with model without pre-training. And for the learning rate, it is consistent with paper, you need to multiply the batch size since the LR in config yaml is normalized by batch size.
Since the length of VQA are usually not very long, we don't conduct truncating.
Hi @jackroos and thanks for the great repo!
I was looking at the cfgs file for VQA and noticed different hyperparameters than in the appendix of the paper. For instance, 5 epochs instead of 20, 500 warmup steps instead of 2000, smaller learning rate, ... Should we follow -- in this and other tasks -- the values in the repository or the ones in the paper?
Also, are inputs not truncated to a maximum length during fine-tuning?
Thanks!