Closed sanyalsunny111 closed 2 years ago
I am a bit confused how did you pretrain on 384 x 384 and then finetune on a different resolution? Did you change the positional encoding like ViT does if yes please point to that part of the code. If not What extra you added in the fine-tune code for fine-tuning on a different resolution?
For VQA finetuning, I use 576x576 following METER paper. But it has a marginal improvement over 480x480. Both work better than 384x384.