vturrisi / solo-learn

solo-learn: a library of self-supervised methods for visual representation learning powered by Pytorch Lightning
MIT License
1.39k stars 181 forks source link

MoCo-v3 pretrain for 1000 epochs gives a collapse learning with ResNet-50 #292

Closed trungpx closed 1 year ago

trungpx commented 1 year ago

Hello guys,

These days I am working on a project using your library and found an issue with MoCo-v3. I would like to share it and ask if you faced the same issue.

Statement: Training MoCo-v3 on ImageNet-100 for config 1000 epochs, the learning suddenly collapsed.

More details can be seen on this picture from wandb. To see the full resolution, just click into the figure. image

I have not faced the issue when running with ResNet-18 on the same config as below figure (1000 epochs). To see the full resolution, just click into the figure. image

Below is the bash file I used for running MoCo-v3 with ResNet-50 (same as this repo). I have just cloned it some days ago. Now I am running again a latest code from this github repo (08/13/2022). It seems to show the same. The issue strated from epoch >250.

python3 ../../../main_pretrain.py \ --dataset imagenet100 \ --backbone resnet50 \ --train_data_path ~/workspace/datasets/imagenet-100/train \ --val_data_path ~/workspace/datasets/imagenet-100/val \ --max_epochs 1000 \ --devices 0,1 \ --accelerator gpu \ --strategy ddp \ --sync_batchnorm \ --precision 16 \ --optimizer lars \ --eta_lars 0.02 \ --exclude_bias_n_norm_lars \ --scheduler warmup_cosine \ --lr 0.3 \ --classifier_lr 0.3 \ --weight_decay 1e-6 \ --batch_size 128 \ --num_workers 4 \ --data_format dali \ --brightness 0.4 \ --contrast 0.4 \ --saturation 0.2 \ --hue 0.1 \ --gaussian_prob 1.0 0.1 \ --solarization_prob 0.0 0.2 \ --min_scale 0.2 \ --num_crops_per_aug 1 1 \ --name mocov3_res50 \ --project AAAI2023_ImageNet100_1000ep \ --entity myentity\ --save_checkpoint \ --wandb \ --method mocov3 \ --proj_hidden_dim 4096 \ --pred_hidden_dim 4096 \ --temperature 0.2 \ --base_tau_momentum 0.99 \ --final_tau_momentum 1.0 \

One more notice that I saw when training with ResNet-50, solo-learn shows warning as below for which it doesn't exist when train with ResNet-18: _"Epoch 0: 0%|| 0/514 [00:00<?, ?it/s]/root/anaconda3/envs/new-solo/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). " /root/anaconda3/envs/new-solo/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). " "_

Note that, the code is cloned from the latest one.

Some info from my env. pytorch-lightning==1.6.4 timm==0.6.7 torch==1.10.0+cu111 torchaudio==0.10.0 torchmetrics==0.6.0 torchvision==0.11.1

It's nice to have you take a look on it.

Thanks so much!

vturrisi commented 1 year ago

Hey, The warning is normal and it's added by us some months ago. About the training, it's very likely that these parameters are not good (lr most likely). We tuned the parameters quite a bit to get the best results with resnet18, and you should probably play around with them as well. I think @DonkeyShot21 was trying some parameters with resnet50 but I'm not sure.

trungpx commented 1 year ago

Thanks for confirmation that the mentioned warning is not issue. I assumed config for res18 also works for res50. Indeed, It failed for res50. As you suggested, I will need to sweep out parameters for res50.