guilk / VLC

Research code for "Training Vision-Language Transformers from Captions Alone"
34 stars 4 forks source link

imagenet finetuned #4

Closed tankche1 closed 2 years ago

tankche1 commented 2 years ago

Hi,

Thanks for the great work!

The code for imagenet finetuned does not seem to work as it can not load a model from VLCTransformer.

I think that is the initial version for MAE.

I wonder if you have the code for VLC model.

Also, many fine-tuned command are missing the test mode(run.py again with test_only=True).

guilk commented 2 years ago

Hi,

Can you check the README file in imagenet_classification subfolder.

For evaluation on ImageNet1K, you may need to convert the model in a MAE style, and then finetune it following the commands. It should give you the number as reported in our paper.

tankche1 commented 2 years ago

Thanks!

Another minor question, i see that the paper states that 'a key difference is initialized with MAE pre-trained on ImageNet-1K'. How important is it? Do you use the weight from Original MAE git rep?

guilk commented 2 years ago

In our preliminary experiments, masked image modeling does not work well if the model is initialized from supervised ImageNet1K.

Yes, we just use the weights provided by MAE git repo. This initialization is different from previous methods that rely on supervised ImageNet1k or pre-trained BERT weights.

tankche1 commented 2 years ago

Hi, I try to reproduce the result but find the results are lower in VQAv2 and NLVR2. I wonder if u can share your exact script for pre-training. I get 69.5% for VQAv2 and 73.3% for NLVR2.

Here is my pre-training command:

python run.py with num_gpus=${P_PER_NODE} num_nodes=$SLURM_JOB_NUM_NODES task_mlm_itm_mae per_gpu_batchsize=16 whole_word_masking=True step200k image_size=384 mae_weight=1.0 

and in config I have:

learning_rate = 1e-4
weight_decay = 0.01
mask_ratio = 0.6
loss_names = _loss_names({"itm": 1, "mlm": 1, "mae": 1})
batch_size = 4096
mae_weight = 1.0

Any help are well appreciated!

Also some minor questions:

  1. why is the number in table.3 lower than in table.2 (in paper)?
  2. Do I need to set a different learning rate for different resolution? (224 | 384) How much gain when use a larger resolution(384 vs. 224) for training?
guilk commented 2 years ago
  1. The numbers in table 3 are trained with 25K steps while those in table 2 are trained with 200k steps.
  2. During pre-training, I use 384x384 instead of 224x224. In our preliminary experiments, image resolution mattered a lot. I did not try different learning rates.

Could you reproduce our VQAv2 result using our provided checkpoint? The 4M images version.

guilk commented 2 years ago

Hi, did you use MAE pretrained weights during pretraining?

tankche1 commented 2 years ago

I can reproduce using the provided checkpoint. I also use MAE pretrained weights during pretraining.

guilk commented 2 years ago

python run.py with data_root=/hcrr01-weka/liangkegui/datasets/downstreams/arrows num_gpus=16 num_nodes=8 task_mlm_itm_mae whole_word_masking=True step200k per_gpu_batchsize=16 pretrain_path="/hcrr01-weka/liangkegui/datasets/pretraining/pretrained_models/mae_pretrain_vit_base_full.pth" image_size=384 log_dir="/mnt/root/vilt_checkpoints/task_mlm_itm_mae_step200k_img384_vinvl_decoder8" mae_weight=1.0

This is the command that I use to train our models. It seems similar to yours.

tankche1 commented 2 years ago

Thanks. Let me re-train a model and get back to you.

guilk commented 2 years ago

He also mentions there is one point gap in downstream tasks in this issue https://github.com/guilk/VLC/issues/2#issue-1276351160

But it should not be less than 70 on VQAv2.

Could you check if all data is prepared and loaded during pretraining. Also check if the MLM loss is decreasing smoothly. You can run it with 25K steps that should give you a number near to 70 at least.

tankche1 commented 2 years ago

The dataset is ok. Can you share the log of your tensorboard? like this one from vilt (https://tensorboard.dev/experiment/mNHxDM08R6eHKeU0JHn5vg/#scalars)

guilk commented 2 years ago

I upload our training logs in releases. I don't know why the events.out.tfevents.1647495391.GCRHYPC034.1162.0 only has a part of logs. 00_2c2f1f0dc1e00d6c5fb1cc32af90d35b-ps-0_stdout.txt is a complete one but less information.

tankche1 commented 2 years ago

Thank you. That is very helpful! My curve now seems to work!

tankche1 commented 2 years ago

I still get 71.34% on VQAv2. The MAE has the same curve but a small gap in MLM (small gap in train but quite large in val). Any advice on that?

Another thing is I find that my running time are 0.3x slower than yours even though I am also using 126 V100. My training speed is similar to VILT. Can u explain why ur training are faster than VILT? (I did not change anything in the code)

Screen Shot 2022-08-01 at 08 50 12
guilk commented 2 years ago

Interesting. What did you change to improve the VQA result from 69.5 to 71.34? The image size is 384x384 during pre-training and 480x480 or 576x576 during funetuning. My result is 71.69 using the model trained with 100k steps. Can you check if you are using all the data? How about ITM loss?

There are a lot of factors that may affect the training speed. You need to store your data in a ssd disk, change num_workers in config.py to a suitable number and probably resize the short edge to 384 during data preprocessing.

tankche1 commented 2 years ago

I made some mistake when preprocessing SBU. And I only load the encoder of the MAE before.(the official repo weight only have encoder).
I am using all the data. I know that CC300M I only have 29 arrows whereas u have 31 arrows. I will try to improve the speed. Here is itm loss:

Screen Shot 2022-08-01 at 12 49 56
guilk commented 2 years ago

Based on the log, you can compute the total amount training data by batch_size x num_steps.

I use pretrained encoder and decoder weights from MAE official repo. check this issue .

tankche1 commented 2 years ago

Thank you for your patience!

yeah i try that. I control the batch size to be as close as 2048 to be possible (mine is 2016 due to GPU number). The training data amount is in the same scale as shown in the curve. I will try to use the same weights. Current weight has the same results on imageNet as the official report.

Screen Shot 2022-08-01 at 13 00 39

.

tankche1 commented 2 years ago

I also noticed that you set norm_pixel = False in the code?

guilk commented 2 years ago

You don't need to change the batch size, as pytorch_lightning will automatically do the gradient accumulation. Just use 4096 as default unless other batch sizes work better.

I think you have to use both the pretrained encoder and decoder weights as shown in that issue. This should give you a better result. So you are only using the pretrained encoder weights?

I think norm_pixel should be True. It should be a mistake. Thanks for pointing it out.

tankche1 commented 2 years ago

My batchsize is 4032. The code use 4096 // (per_batch_gpu gpu_nums) (per_batch_gpu * gpu_nums) as batchsize.

yes, i use both encoder and decoder.

I will try set it as True.

guilk commented 2 years ago

Sorry that it does not matter for the norm_pixel flag. I compute mae loss in the objectives.py file which only uses normalized pixels.

To get the result of 71.34%, how did you initialize your model? Are we using the same provided checkpoint?

tankche1 commented 2 years ago

I am using an MAE that I trained for 800 epochs using the MAE codes. It can achieve the same number in the MAE paper. I load both encoder and decoder.