Decreased Top1 accuracy under Multi GPUs in CIFAR10

nanhuayu commented 2 years ago

Hi, Thank you for providing this ssl library and rapid response. Now I'm confused with a new question, and I don't know if it can be solved.

I've compared the results of ssl methods (SimCLR, BYOL) of CIFAR10 under single GPU and multi GPUs. Both of the multi GPUs results are lower then the single GPU under 1000 epoches. The multi GPUs results were caculated with --sync_batchnorm and --strategy ddp. Are there any suggestions to solve this problem?

BYOL results are shown as below (the 64 batch size results are still being calculated).

SimCLR results are shown as below.

I also calculated the multi GPU results with sgd optimizer.

nanhuayu commented 2 years ago

There seems to be some problems in the lr optimizer. Should the warmup_cosine scheduler multi GPU version be changed? @vturrisi @DonkeyShot21

if self.scheduler == "warmup_cosine"

Here is the multi GPUs results of lr optimizer:

Here is the single GPU results of lr optimizer:

DonkeyShot21 commented 2 years ago

Yes, that doesn't look correct, I'll get back to you soon. Make sure you are using the latest commit.

vturrisi commented 2 years ago

@nanhuayu I just ran simsiam for 20 epochs with 1 GPU and batch size 256 and 2 GPUs and batch size 128 (per GPU). My results are pretty much the same (the scheduler is exactly the same, and the losses are a tiny bit different because of random initialization or stuff like this). Can you share your scripts and check that you are running the latest version?

python3 main_pretrain.py \
    --dataset $1 \
    --backbone resnet18 \
    --data_dir ./datasets \
    --max_epochs 20 \
    --devices 0 \
    --accelerator gpu \
    --precision 16 \
    --optimizer sgd \
    --scheduler warmup_cosine \
    --lr 0.5 \
    --classifier_lr 0.1 \
    --weight_decay 1e-5 \
    --batch_size 256 \
    --num_workers 4 \
    --crop_size 32 \
    --brightness 0.4 \
    --contrast 0.4 \
    --saturation 0.4 \
    --hue 0.1 \
    --gaussian_prob 0.0 0.0 \
    --crop_size 32 \
    --num_crops_per_aug 1 1 \
    --zero_init_residual \
    --name simsiam-$1 \
    --project solo-learn \
    --entity unitn-mhug \
    --wandb \
    --save_checkpoint \
    --method simsiam \
    --proj_hidden_dim 2048 \
    --pred_hidden_dim 512 \
    --proj_output_dim 2048

python3 main_pretrain.py \
    --dataset $1 \
    --backbone resnet18 \
    --data_dir ./datasets \
    --max_epochs 20 \
    --devices 0,1 \
    --accelerator gpu \
    --strategy ddp \
    --precision 16 \
    --optimizer sgd \
    --scheduler warmup_cosine \
    --lr 0.5 \
    --classifier_lr 0.1 \
    --weight_decay 1e-5 \
    --batch_size 128 \
    --num_workers 4 \
    --crop_size 32 \
    --brightness 0.4 \
    --contrast 0.4 \
    --saturation 0.4 \
    --hue 0.1 \
    --gaussian_prob 0.0 0.0 \
    --crop_size 32 \
    --num_crops_per_aug 1 1 \
    --zero_init_residual \
    --name simsiam-$1 \
    --project solo-learn \
    --entity unitn-mhug \
    --wandb \
    --save_checkpoint \
    --method simsiam \
    --proj_hidden_dim 2048 \
    --pred_hidden_dim 512 \
    --proj_output_dim 2048

nanhuayu commented 2 years ago

I will check the version soon, there may be little problem in the script.

nanhuayu commented 2 years ago

I ran the same script and under both 1.0.4 version and solo-learn-main vesion, but the results were different. I haven't installed the solo module in the pip yet. Is it possible that the version of pl library or torch library has an impact? @vturrisi

python3 main_pretrain.py \ --dataset $1 \ --backbone resnet18 \ --data_dir ./datasets \ --max_epochs 20 \ --devices 0,1 \ --accelerator gpu \ --strategy ddp \ --precision 16 \ --optimizer sgd \ --scheduler warmup_cosine \ --lr 0.5 \ --classifier_lr 0.1 \ --weight_decay 1e-5 \ --batch_size 128 \ --num_workers 4 \ --crop_size 32 \ --brightness 0.4 \ --contrast 0.4 \ --saturation 0.4 \ --hue 0.1 \ --gaussian_prob 0.0 0.0 \ --crop_size 32 \ --num_crops_per_aug 1 1 \ --zero_init_residual \ --name simsiam-$1-ddp2-2 \ --project solo \ --entity nanhuayu \ --wandb \ --save_checkpoint \ --method simsiam \ --proj_hidden_dim 2048 \ --pred_hidden_dim 512 \ --proj_output_dim 2048

vturrisi commented 2 years ago

I'm not really sure what I'm supposed to look at. Also, the main and 1.0.4 are the same version. Can you reclone the repo and try from scratch?

nanhuayu commented 2 years ago

I' tried several times, from different machine, and different pytorch version, and both the 1.0.4 and main version, which makes no difference. Is there any method to find the problem? @vturrisi

nanhuayu commented 2 years ago

I've confirmed the bug in vesion 1.0.4 by print the num_training_steps @vturrisi @DonkeyShot21

vturrisi commented 2 years ago

Which pytorch lightning version are you using? @nanhuayu

nanhuayu commented 2 years ago

pytorch-lightning 1.6.4 liggtning-bolts 0.5.0 @vturrisi

vturrisi commented 2 years ago

@nanhuayu Pretty strange that you are getting this behaviour since I couldn't reproduce it. For sure it's related to how pytorch-lightning is parsing stuff, so it might have changed and made our code incompatible. I'm going to fix this as soon as I have time, but for now, you can manually scale the learning rate.

nanhuayu commented 2 years ago

Thanks for your reply. @vturrisi I changed the the code related to num_devices in solo/methods/base.py, which running well in CIFAR and ImageNet100 datasets.

However, another problem happened when I used the DALI in ImageNet100 datasets. There should be 130000/(4128bs)400ep=101562 steps, and the num_training_steps showed 253 steps, but there are only a half of the steps (50600 steps) in the final summary.

Is it related to num_crops_per_aug?

main_pretrain.py --dataset imagenet100 --backbone resnet18 --data_dir datasets --train_dir imagenet-100/train --val_dir imagenet-100/val --max_epochs 400 --devices 0,1,2,3 --accelerator gpu --strategy ddp --sync_batchnorm --precision 16 --optimizer lars --grad_clip_lars --eta_lars 0.02 --exclude_bias_n_norm --scheduler warmup_cosine --lr 0.3 --weight_decay 1e-4 --batch_size 256 --num_workers 4 --brightness 0.8 --contrast 0.8 --saturation 0.8 --hue 0.2 --num_crops_per_aug 2 --name simclr-400ep-imagenet100-resnet18-256 --project solo --entity nanhuayu --dali --wandb --save_checkpoint --method simclr --temperature 0.2 --proj_hidden_dim 2048

DonkeyShot21 commented 2 years ago

Any news on this? Did you manage to find the problem?

nanhuayu commented 2 years ago

The bugs related to num_devices in solo/methods/base.py need to be fixed, or change the version of pytorch-lightning back to 1.5.10. @DonkeyShot21

vturrisi commented 2 years ago

@nanhuayu I'll check this soon and migrate to the new parameters that pytorch lightning uses.

vturrisi commented 2 years ago

The issue has been fixed in #269. Nvidia-dali will properly fix this in 1.16 and we will remove the temporary workaround.

vturrisi / solo-learn

Decreased Top1 accuracy under Multi GPUs in CIFAR10 #262