guilk / VLC

Research code for "Training Vision-Language Transformers from Captions Alone"
33 stars 4 forks source link

What is the image resolution for VQA finetuning 384 x 384 like the pretraining? #6

Closed sanyalsunny111 closed 2 years ago

guilk commented 2 years ago

For VQA finetuning, I use 576x576 following METER paper. But it has a marginal improvement over 480x480. Both work better than 384x384.

sanyalsunny111 commented 2 years ago

I am a bit confused how did you pretrain on 384 x 384 and then finetune on a different resolution? Did you change the positional encoding like ViT does if yes please point to that part of the code. If not What extra you added in the fine-tune code for fine-tuning on a different resolution?

guilk commented 2 years ago

Check https://github.com/guilk/VLC/blob/ab05438e539c6a08454cbf9b84934814e6ed4452/vlc/modules/objectives.py#L645