Closed nanhuayu closed 2 years ago
There seems to be some problems in the lr optimizer. Should the warmup_cosine scheduler multi GPU version be changed? @vturrisi @DonkeyShot21
if self.scheduler == "warmup_cosine"
Here is the multi GPUs results of lr optimizer:
Here is the single GPU results of lr optimizer:
Yes, that doesn't look correct, I'll get back to you soon. Make sure you are using the latest commit.
@nanhuayu I just ran simsiam for 20 epochs with 1 GPU and batch size 256 and 2 GPUs and batch size 128 (per GPU). My results are pretty much the same (the scheduler is exactly the same, and the losses are a tiny bit different because of random initialization or stuff like this). Can you share your scripts and check that you are running the latest version?
python3 main_pretrain.py \
--dataset $1 \
--backbone resnet18 \
--data_dir ./datasets \
--max_epochs 20 \
--devices 0 \
--accelerator gpu \
--precision 16 \
--optimizer sgd \
--scheduler warmup_cosine \
--lr 0.5 \
--classifier_lr 0.1 \
--weight_decay 1e-5 \
--batch_size 256 \
--num_workers 4 \
--crop_size 32 \
--brightness 0.4 \
--contrast 0.4 \
--saturation 0.4 \
--hue 0.1 \
--gaussian_prob 0.0 0.0 \
--crop_size 32 \
--num_crops_per_aug 1 1 \
--zero_init_residual \
--name simsiam-$1 \
--project solo-learn \
--entity unitn-mhug \
--wandb \
--save_checkpoint \
--method simsiam \
--proj_hidden_dim 2048 \
--pred_hidden_dim 512 \
--proj_output_dim 2048
python3 main_pretrain.py \
--dataset $1 \
--backbone resnet18 \
--data_dir ./datasets \
--max_epochs 20 \
--devices 0,1 \
--accelerator gpu \
--strategy ddp \
--precision 16 \
--optimizer sgd \
--scheduler warmup_cosine \
--lr 0.5 \
--classifier_lr 0.1 \
--weight_decay 1e-5 \
--batch_size 128 \
--num_workers 4 \
--crop_size 32 \
--brightness 0.4 \
--contrast 0.4 \
--saturation 0.4 \
--hue 0.1 \
--gaussian_prob 0.0 0.0 \
--crop_size 32 \
--num_crops_per_aug 1 1 \
--zero_init_residual \
--name simsiam-$1 \
--project solo-learn \
--entity unitn-mhug \
--wandb \
--save_checkpoint \
--method simsiam \
--proj_hidden_dim 2048 \
--pred_hidden_dim 512 \
--proj_output_dim 2048
I will check the version soon, there may be little problem in the script.
I ran the same script and under both 1.0.4 version and solo-learn-main vesion, but the results were different. I haven't installed the solo module in the pip yet. Is it possible that the version of pl library or torch library has an impact? @vturrisi
python3 main_pretrain.py \ --dataset $1 \ --backbone resnet18 \ --data_dir ./datasets \ --max_epochs 20 \ --devices 0,1 \ --accelerator gpu \ --strategy ddp \ --precision 16 \ --optimizer sgd \ --scheduler warmup_cosine \ --lr 0.5 \ --classifier_lr 0.1 \ --weight_decay 1e-5 \ --batch_size 128 \ --num_workers 4 \ --crop_size 32 \ --brightness 0.4 \ --contrast 0.4 \ --saturation 0.4 \ --hue 0.1 \ --gaussian_prob 0.0 0.0 \ --crop_size 32 \ --num_crops_per_aug 1 1 \ --zero_init_residual \ --name simsiam-$1-ddp2-2 \ --project solo \ --entity nanhuayu \ --wandb \ --save_checkpoint \ --method simsiam \ --proj_hidden_dim 2048 \ --pred_hidden_dim 512 \ --proj_output_dim 2048
I'm not really sure what I'm supposed to look at. Also, the main and 1.0.4 are the same version. Can you reclone the repo and try from scratch?
I' tried several times, from different machine, and different pytorch version, and both the 1.0.4 and main version, which makes no difference. Is there any method to find the problem? @vturrisi
I've confirmed the bug in vesion 1.0.4 by print the num_training_steps
@vturrisi @DonkeyShot21
Which pytorch lightning version are you using? @nanhuayu
pytorch-lightning 1.6.4 liggtning-bolts 0.5.0 @vturrisi
@nanhuayu Pretty strange that you are getting this behaviour since I couldn't reproduce it. For sure it's related to how pytorch-lightning is parsing stuff, so it might have changed and made our code incompatible. I'm going to fix this as soon as I have time, but for now, you can manually scale the learning rate.
Thanks for your reply. @vturrisi
I changed the the code related to num_devices
in solo/methods/base.py
, which running well in CIFAR and ImageNet100 datasets.
However, another problem happened when I used the DALI in ImageNet100 datasets.
There should be 130000/(4128bs)400ep=101562 steps, and the num_training_steps
showed 253 steps,
but there are only a half of the steps (50600 steps) in the final summary.
Is it related to num_crops_per_aug
?
main_pretrain.py --dataset imagenet100 --backbone resnet18 --data_dir datasets --train_dir imagenet-100/train --val_dir imagenet-100/val --max_epochs 400 --devices 0,1,2,3 --accelerator gpu --strategy ddp --sync_batchnorm --precision 16 --optimizer lars --grad_clip_lars --eta_lars 0.02 --exclude_bias_n_norm --scheduler warmup_cosine --lr 0.3 --weight_decay 1e-4 --batch_size 256 --num_workers 4 --brightness 0.8 --contrast 0.8 --saturation 0.8 --hue 0.2 --num_crops_per_aug 2 --name simclr-400ep-imagenet100-resnet18-256 --project solo --entity nanhuayu --dali --wandb --save_checkpoint --method simclr --temperature 0.2 --proj_hidden_dim 2048
Any news on this? Did you manage to find the problem?
The bugs related to num_devices
in solo/methods/base.py
need to be fixed, or change the version of pytorch-lightning back to 1.5.10. @DonkeyShot21
@nanhuayu I'll check this soon and migrate to the new parameters that pytorch lightning uses.
The issue has been fixed in #269. Nvidia-dali will properly fix this in 1.16 and we will remove the temporary workaround.
Hi, Thank you for providing this ssl library and rapid response. Now I'm confused with a new question, and I don't know if it can be solved.
I've compared the results of ssl methods (SimCLR, BYOL) of CIFAR10 under single GPU and multi GPUs. Both of the multi GPUs results are lower then the single GPU under 1000 epoches. The multi GPUs results were caculated with
--sync_batchnorm
and--strategy ddp
. Are there any suggestions to solve this problem?BYOL results are shown as below (the 64 batch size results are still being calculated).
SimCLR results are shown as below.
I also calculated the multi GPU results with sgd optimizer.