Closed tankche1 closed 2 years ago
Hi,
Can you check the README file in imagenet_classification
subfolder.
For evaluation on ImageNet1K, you may need to convert the model in a MAE style, and then finetune it following the commands. It should give you the number as reported in our paper.
Thanks!
Another minor question, i see that the paper states that 'a key difference is initialized with MAE pre-trained on ImageNet-1K'. How important is it? Do you use the weight from Original MAE git rep?
In our preliminary experiments, masked image modeling does not work well if the model is initialized from supervised ImageNet1K.
Yes, we just use the weights provided by MAE git repo. This initialization is different from previous methods that rely on supervised ImageNet1k or pre-trained BERT weights.
Hi, I try to reproduce the result but find the results are lower in VQAv2 and NLVR2. I wonder if u can share your exact script for pre-training. I get 69.5% for VQAv2 and 73.3% for NLVR2.
Here is my pre-training command:
python run.py with num_gpus=${P_PER_NODE} num_nodes=$SLURM_JOB_NUM_NODES task_mlm_itm_mae per_gpu_batchsize=16 whole_word_masking=True step200k image_size=384 mae_weight=1.0
and in config I have:
learning_rate = 1e-4
weight_decay = 0.01
mask_ratio = 0.6
loss_names = _loss_names({"itm": 1, "mlm": 1, "mae": 1})
batch_size = 4096
mae_weight = 1.0
Any help are well appreciated!
Also some minor questions:
Could you reproduce our VQAv2 result using our provided checkpoint? The 4M images version.
Hi, did you use MAE pretrained weights during pretraining?
I can reproduce using the provided checkpoint. I also use MAE pretrained weights during pretraining.
python run.py with data_root=/hcrr01-weka/liangkegui/datasets/downstreams/arrows num_gpus=16 num_nodes=8 task_mlm_itm_mae whole_word_masking=True step200k per_gpu_batchsize=16 pretrain_path="/hcrr01-weka/liangkegui/datasets/pretraining/pretrained_models/mae_pretrain_vit_base_full.pth" image_size=384 log_dir="/mnt/root/vilt_checkpoints/task_mlm_itm_mae_step200k_img384_vinvl_decoder8" mae_weight=1.0
This is the command that I use to train our models. It seems similar to yours.
Thanks. Let me re-train a model and get back to you.
He also mentions there is one point gap in downstream tasks in this issue https://github.com/guilk/VLC/issues/2#issue-1276351160
But it should not be less than 70 on VQAv2.
Could you check if all data is prepared and loaded during pretraining. Also check if the MLM loss is decreasing smoothly. You can run it with 25K steps that should give you a number near to 70 at least.
The dataset is ok. Can you share the log of your tensorboard? like this one from vilt (https://tensorboard.dev/experiment/mNHxDM08R6eHKeU0JHn5vg/#scalars)
I upload our training logs in releases. I don't know why the events.out.tfevents.1647495391.GCRHYPC034.1162.0 only has a part of logs. 00_2c2f1f0dc1e00d6c5fb1cc32af90d35b-ps-0_stdout.txt is a complete one but less information.
Thank you. That is very helpful! My curve now seems to work!
I still get 71.34% on VQAv2. The MAE has the same curve but a small gap in MLM (small gap in train but quite large in val). Any advice on that?
Another thing is I find that my running time are 0.3x slower than yours even though I am also using 126 V100. My training speed is similar to VILT. Can u explain why ur training are faster than VILT? (I did not change anything in the code)
Interesting. What did you change to improve the VQA result from 69.5 to 71.34? The image size is 384x384 during pre-training and 480x480 or 576x576 during funetuning. My result is 71.69 using the model trained with 100k steps. Can you check if you are using all the data? How about ITM loss?
There are a lot of factors that may affect the training speed. You need to store your data in a ssd disk, change num_workers in config.py to a suitable number and probably resize the short edge to 384 during data preprocessing.
I made some mistake when preprocessing SBU. And I only load the encoder of the MAE before.(the official repo weight only have encoder).
I am using all the data. I know that CC300M I only have 29 arrows whereas u have 31 arrows.
I will try to improve the speed.
Here is itm loss:
Based on the log, you can compute the total amount training data by batch_size x num_steps.
I use pretrained encoder and decoder weights from MAE official repo. check this issue .
Thank you for your patience!
yeah i try that. I control the batch size to be as close as 2048 to be possible (mine is 2016 due to GPU number). The training data amount is in the same scale as shown in the curve. I will try to use the same weights. Current weight has the same results on imageNet as the official report.
.
I also noticed that you set norm_pixel = False in the code?
You don't need to change the batch size, as pytorch_lightning will automatically do the gradient accumulation. Just use 4096 as default unless other batch sizes work better.
I think you have to use both the pretrained encoder and decoder weights as shown in that issue. This should give you a better result. So you are only using the pretrained encoder weights?
I think norm_pixel should be True. It should be a mistake. Thanks for pointing it out.
My batchsize is 4032. The code use 4096 // (per_batch_gpu gpu_nums) (per_batch_gpu * gpu_nums) as batchsize.
yes, i use both encoder and decoder.
I will try set it as True.
Sorry that it does not matter for the norm_pixel flag. I compute mae loss in the objectives.py file which only uses normalized pixels.
To get the result of 71.34%, how did you initialize your model? Are we using the same provided checkpoint?
I am using an MAE that I trained for 800 epochs using the MAE codes. It can achieve the same number in the MAE paper. I load both encoder and decoder.
Hi,
Thanks for the great work!
The code for imagenet finetuned does not seem to work as it can not load a model from VLCTransformer.
I think that is the initial version for MAE.
I wonder if you have the code for VLC model.
Also, many fine-tuned command are missing the test mode(run.py again with test_only=True).