This is a PyTorch/GPU implementation of the paper VLC. Our work is built on MAE and the pioneering work ViLT.
pip install -r requirements.txt
pip install -e .
Task | Base set (4M) | Large set (5.6M) |
---|---|---|
Pre-training |
vlc_baseset.ckpt | vlc_largeset.ckpt |
VQA |
vlc_baseset_vqa_submission | vlc_largeset_vqa_submission |
We follow ViLT and use pyarrow
to serialize the datasets. See this link for details.
As there are some corrupted images in Google Conceptual captions, we remove the images if they cannot be loaded by PIL. Check check_valid_images.py
in data_process
folder.
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_mlm_itm_mae per_gpu_batchsize=<BS_FITS_YOUR_GPU> whole_word_masking=True step25k image_size=384 pretrain_path=<PRETRAIN_PATH> log_dir=<LOG_FOLDER> mae_weight=1.0
Following ALBEF and UNITER, we also use VG-VQA data during VQAv2 finetuning.
We only consider the VG-VQA question-answer pairs if 1) the corresponding images are in VQAv2 training or validation split; 2) the answers appear in the VQAv2 answer set.
Check the map_vg_mscoco.py
and write_valid_vgqa.py
in data_process
folder.
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_vqa_mae_randaug per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> log_dir=<LOG_FOLDER> image_size=576 learning_rate=5e-4
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_nlvr2_mae_randaug per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> log_dir=<LOG_FOLDER> image_size=384 learning_rate=5e-4
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_irtr_coco_mae_randaug per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> log_dir=<LOG_FOLDER> image_size=384 learning_rate=5e-4
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_irtr_f30k_mae_randaug per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> log_dir=<LOG_FOLDER> image_size=384 learning_rate=5e-4
python -m launch --nnodes=2 --nproc_per_node=16 --master_port 44875 main_finetune.py
--batch_size 32
--model vit_base_patch16
--finetune <PRETRAINED_MODEL>
--epochs 100
--input_size 384
--blr 5e-4
--layer_decay 0.65
--weight_decay 0.05
--drop_path 0.1
--reprob 0.25
--mixup 0.8
--cutmix 1.0
--dist_eval
--data_path <ImageNet-1K ROOT>
--output_dir <DIR to SAVE CHECKPOINTS>
The code is based on ViLT licensed under Apache 2.0 and MAE under the CC-BY-NC 4.0 license.