Some errors with multi GPUs in a single node.

nanhuayu commented 2 years ago

Hello, I am confused about the errors when pretraining models using multiple GPUs in a single node. The pretraining scripts works well in a single GPU with --devices=0. however, when the I changed to --devices=0,1, the module crashed. The traced error is shown as following:

Traceback (most recent call last): File "main_pretrain.py", line 205, in main() File "main_pretrain.py", line 201, in main trainer.fit(model, train_loader, val_loader, ckpt_path=ckpt_path) File "/usr/local/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit self._call_and_handle_interrupt( File "/usr/local/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs) File "/usr/local/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/spawn.py", line 78, in launch mp.spawn( File "/usr/local/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes process.start() File "/usr/local/conda/lib/python3.8/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/usr/local/conda/lib/python3.8/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/usr/local/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/usr/local/conda/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/usr/local/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/usr/local/conda/lib/python3.8/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'TorchGraph.create_forward_hook..after_forward_hook'

Then I changed the pretraining script by removed the --wandb, the error persists, and he traced error is shown as following:

Traceback (most recent call last): File "main_pretrain.py", line 205, in main() File "main_pretrain.py", line 201, in main trainer.fit(model, train_loader, val_loader, ckpt_path=ckpt_path) File "/usr/local/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit self._call_and_handle_interrupt( File "/usr/local/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs) File "/usr/local/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/spawn.py", line 78, in launch mp.spawn( File "/usr/local/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes process.start() File "/usr/local/conda/lib/python3.8/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/usr/local/conda/lib/python3.8/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/usr/local/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/usr/local/conda/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/usr/local/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/usr/local/conda/lib/python3.8/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'dataset_with_index..DatasetWithIndex'

vturrisi commented 2 years ago

Hi @nanhuayu can you share your training script?

DonkeyShot21 commented 2 years ago

You are probably missing --strategy ddp --accelerator gpu can you check if that is the problem?

nanhuayu commented 2 years ago

@DonkeyShot21 @vturrisi The problem has been solved, Thanks for your reply.

nanhuayu commented 2 years ago

Now I have another little question, I've compared the the simCLR results of one GPU and 4 GPUs, is it normal that the top 1 accurracy reduced from 88.9 to 81.5? @vturrisi

main_pretrain.py --dataset cifar10 --backbone resnet18 --data_dir ./datasets --max_epochs 1000 --devices 0,1,2,3 --accelerator gpu --precision 16 --optimizer sgd --lars --grad_clip_lars --eta_lars 0.02 --exclude_bias_n_norm --scheduler warmup_cosine --lr 0.4 --classifier_lr 0.1 --weight_decay 1e-5 --batch_size 256 --num_workers 4 --crop_size 32 --brightness 0.8 --contrast 0.8 --saturation 0.8 --hue 0.2 --gaussian_prob 0.0 0.0 --crop_size 32 --num_crops_per_aug 1 1 --name simclr-cifar10-ddp4 --project solo --entity nanhuayu --wandb --save_checkpoint --auto_resume --method simclr --temperature 0.2 --proj_hidden_dim 2048 --proj_output_dim 256 --strategy ddp

vturrisi commented 2 years ago

The batch size is per gpu, so when you ran with 4 gpus, you are effectively reducing the number of update steps that you take by 4. To compensate this, you should increase your learning rate. That's also why your plots show that the number of steps is much less.

I'm leaving it open, close the issue when everything is clear :)

nanhuayu commented 2 years ago

Thanks for your reply! I've changed the "--lr 0.4 --classifier_lr 0.1" to "--lr 1.6 --classifier_lr 0.4", and waiting for calculated results.

vturrisi commented 2 years ago

Ah, sorry that I didn't mention. We usually don't scale the lr for the classifier. Also, be careful if you increase too much then lr it can also break.

nanhuayu commented 2 years ago

I'll give it a try.

DonkeyShot21 commented 2 years ago

Actually, we already multiply the lr by the number of devices, so no need to increase it in the script https://github.com/vturrisi/solo-learn/blob/3915fe6294ddd4f3e6a284bd7d2aa9648b2612ae/solo/args/utils.py#L224

Did you run hyperparameter tuning with the larger batch size? SimCLR is particularly sensitive to the size of the batch. You might need to tune the temperature for instance and other parameters. Also, first, I recommend running the same experiment with the same batch size but 4 GPUs instead of increasing the batch size and the number of GPUs simultaneously, so you are sure that there is no other difference.

nanhuayu commented 2 years ago

@DonkeyShot21 Maybe I need to look for the reason carefully, and thanks for the direction. I didn't change the hyperparameter except the "--strategy ddp" and "--devices 0,1,2,3" Maybe some hyperparameter settings are incorrect, and I will try to tune the temperature.

main_pretrain.py --dataset cifar10 --backbone resnet18 --data_dir ./datasets --max_epochs 1000 --devices 0 --accelerator gpu --precision 16 --optimizer sgd --lars --grad_clip_lars --eta_lars 0.02 --exclude_bias_n_norm --scheduler warmup_cosine --lr 0.4 --classifier_lr 0.1 --weight_decay 1e-5 --batch_size 256 --num_workers 4 --crop_size 32 --brightness 0.8 --contrast 0.8 --saturation 0.8 --hue 0.2 --gaussian_prob 0.0 0.0 --crop_size 32 --num_crops_per_aug 1 1 --name simclr-cifar10 --project solo --entity nanhuayu --wandb --save_checkpoint --auto_resume --method simclr --temperature 0.2 --proj_hidden_dim 2048 --proj_output_dim 256

main_pretrain.py --dataset cifar10 --backbone resnet18 --data_dir ./datasets --max_epochs 1000 --devices 0,1,2,3 --accelerator gpu --precision 16 --optimizer sgd --lars --grad_clip_lars --eta_lars 0.02 --exclude_bias_n_norm --scheduler warmup_cosine --lr 0.4 --classifier_lr 0.1 --weight_decay 1e-5 --batch_size 256 --num_workers 4 --crop_size 32 --brightness 0.8 --contrast 0.8 --saturation 0.8 --hue 0.2 --gaussian_prob 0.0 0.0 --crop_size 32 --num_crops_per_aug 1 1 --name simclr-cifar10-ddp4 --project solo --entity nanhuayu --wandb --save_checkpoint --auto_resume --method simclr --temperature 0.2 --proj_hidden_dim 2048 --proj_output_dim 256 --strategy ddp

DonkeyShot21 commented 2 years ago

Yeah so what I meant was something like --devices 0,1,2,3 --batch_size 64 and leave everything the same. This way you just ablate the difference in performance introduced by using multiple gpus. Hopefully, this results in comparable accuracy as it should.

nanhuayu commented 2 years ago

The train_acc1 and val_acc1 are much lower in the first 200 epoch with 4GPUs.

DonkeyShot21 commented 2 years ago

@vturrisi I thought we fixed all gathering issues some time ago, can you check?

vturrisi commented 2 years ago

Yeah @DonkeyShot21 @nanhuayu I'll try to check it later today.

DonkeyShot21 commented 2 years ago

Wait a moment, i know what is wrong: you need to add --sync_batchnorm cause otherwise you are computing the stats on 64 samples only

nanhuayu commented 2 years ago

I've tried the --sync_batchnorm argument, the train_acc1 results seems to have insignificant changes compared to the results without the argument in the first 100 epoches, which is lower than the single GPU results obviously. I've checked the debug.log and confirmed that the sync_batchnorm is on. Should I add other argument or hyperparameter?

{'dataset': 'cifar10', 'data_dir': './datasets', 'train_dir': 'None', 'val_dir': 'None', 'data_fraction': -1.0, 'dali': False, 'num_crops_per_aug': [1, 1], 'brightness': [0.8, 0.8], 'contrast': [0.8, 0.8], 'saturation': [0.8, 0.8], 'hue': [0.2, 0.2], 'color_jitter_prob': [0.8, 0.8], 'gray_scale_prob': [0.2, 0.2], 'horizontal_flip_prob': [0.5, 0.5], 'gaussian_prob': [0.0, 0.0], 'solarization_prob': [0.0, 0.0], 'crop_size': [32, 32], 'min_scale': [0.08, 0.08], 'max_scale': [1.0, 1.0], 'debug_augmentations': False, 'no_labels': False, 'mean': [0.485, 0.456, 0.406], 'std': [0.228, 0.224, 0.225], 'logger': True, 'checkpoint_callback': 'None', 'enable_checkpointing': True, 'default_root_dir': 'None', 'gradient_clip_val': 'None', 'gradient_clip_algorithm': 'None', 'process_position': 0, 'num_nodes': 1, 'num_processes': 'None', 'devices': [0, 1, 2, 3], 'gpus': 'None', 'auto_select_gpus': False, 'tpu_cores': 'None', 'ipus': 'None', 'log_gpu_memory': 'None', 'progress_bar_refresh_rate': 'None', 'enable_progress_bar': True, 'overfit_batches': 0.0, 'track_grad_norm': -1, 'check_val_every_n_epoch': 1, 'fast_dev_run': False, 'accumulate_grad_batches': 'None', 'max_epochs': 1000, 'min_epochs': 'None', 'max_steps': -1, 'min_steps': 'None', 'max_time': 'None', 'limit_train_batches': 'None', 'limit_val_batches': 'None', 'limit_test_batches': 'None', 'limit_predict_batches': 'None', 'val_check_interval': 'None', 'flush_logs_every_n_steps': 'None', 'log_every_n_steps': 50, 'accelerator': 'gpu', 'strategy': 'ddp', 'sync_batchnorm': True, 'precision': 16, 'enable_model_summary': True, 'weights_summary': 'top', 'weights_save_path': 'None', 'num_sanity_val_steps': 2, 'resume_from_checkpoint': 'None', 'profiler': 'None', 'benchmark': 'None', 'deterministic': 'None', 'reload_dataloaders_every_n_epochs': 0, 'auto_lr_find': False, 'replace_sampler_ddp': True, 'detect_anomaly': False, 'auto_scale_batch_size': False, 'prepare_data_per_node': 'None', 'plugins': 'None', 'amp_backend': 'native', 'amp_level': 'None', 'move_metrics_to_cpu': False, 'multiple_trainloader_mode': 'max_size_cycle', 'stochastic_weight_avg': False, 'terminate_on_nan': 'None', 'method': 'simclr', 'backbone': 'resnet18', 'batch_size': 64, 'lr': 0.4, 'classifier_lr': 0.1, 'weight_decay': 1e-05, 'num_workers': 4, 'name': 'simclr-cifar10-ddp4-lr4-5', 'project': 'solo', 'entity': 'nanhuayu', 'wandb': True, 'offline': False, 'optimizer': 'sgd', 'lars': True, 'grad_clip_lars': True, 'eta_lars': 0.02, 'exclude_bias_n_norm': True, 'scheduler': 'warmup_cosine', 'lr_decay_steps': 'None', 'min_lr': 0.0, 'warmup_start_lr': 3e-05, 'warmup_epochs': 10, 'scheduler_interval': 'step', 'knn_eval': False, 'knn_k': 20, 'no_channel_last': False, 'proj_output_dim': 256, 'proj_hidden_dim': 2048, 'temperature': 0.2, 'save_checkpoint': True, 'auto_umap': False, 'auto_resume': True, 'checkpoint_dir': 'trained_models', 'checkpoint_frequency': 1, 'auto_resumer_max_hours': 36, 'num_classes': 10, 'unique_augs': 2, 'transform_kwargs': [{'brightness': 0.8, 'contrast': 0.8, 'saturation': 0.8, 'hue': 0.2, 'color_jitter_prob': 0.8, 'gray_scale_prob': 0.2, 'horizontal_flip_prob': 0.5, 'gaussian_prob': 0.0, 'solarization_prob': 0.0, 'crop_size': 32, 'min_scale': 0.08, 'max_scale': 1.0}, {'brightness': 0.8, 'contrast': 0.8, 'saturation': 0.8, 'hue': 0.2, 'color_jitter_prob': 0.8, 'gray_scale_prob': 0.2, 'horizontal_flip_prob': 0.5, 'gaussian_prob': 0.0, 'solarization_prob': 0.0, 'crop_size': 32, 'min_scale': 0.08, 'max_scale': 1.0}], 'num_large_crops': 2, 'num_small_crops': 0, 'backbone_args/cifar': True, 'backbone_args/zero_init_residual': False, 'extra_optimizer_args/momentum': 0.9}

nanhuayu commented 2 years ago

Do I need to set parameters such as gpus, num_nodes, auto_select_gpus, etc. in the command line? @DonkeyShot21 @vturrisi

DonkeyShot21 commented 2 years ago

No, it should be fine as is. Can you post here some screenshots of the loss and accuracy with and without multigpu? I mean the 1GPU run vs the 4GPUs run (with bs 64 and sync_batchnorm).

nanhuayu commented 2 years ago

The DDP4-5 version is the results with bs 64 and sync_batchnorm. @DonkeyShot21

main_pretrain.py --dataset cifar10 --backbone resnet18 --data_dir ./datasets --max_epochs 1000 --devices 0,1,2,3 --accelerator gpu --precision 16 --optimizer sgd --lars --grad_clip_lars --eta_lars 0.02 --exclude_bias_n_norm --scheduler warmup_cosine --lr 0.4 --classifier_lr 0.1 --weight_decay 1e-5 --batch_size 64 --num_workers 4 --crop_size 32 --brightness 0.8 --contrast 0.8 --saturation 0.8 --hue 0.2 --gaussian_prob 0.0 0.0 --crop_size 32 --num_crops_per_aug 1 1 --name simclr-cifar10-ddp4-lr4-5 --project solo --entity nanhuayu --wandb --save_checkpoint --auto_resume --method simclr --temperature 0.2 --proj_hidden_dim 2048 --proj_output_dim 256 --strategy ddp --sync_batchnorm

DonkeyShot21 commented 2 years ago

What about the ssl loss? Maybe it's just a problem with the linear layer!

nanhuayu commented 2 years ago

Here is the screenshots of the loss. @DonkeyShot21 There is no significant difference between results with and without sync_batchnorm. I'm guessing if it's because sync_batchnorm isn't working properly, when I debug the lightning module I found that there is only one node. Is there only method to confirm the problem in linear layer?

DonkeyShot21 commented 2 years ago

ok so the nce loss is much lower, it is not gathering correctly @vturrisi

DonkeyShot21 commented 2 years ago

if the train_nce_loss_epoch was the same, then it would just be the linear, but in your screenshots, it's clear that the nce loss is different.

vturrisi commented 2 years ago

I'll put it in my pipeline @DonkeyShot21 @nanhuayu and I'll eventually get to it. Note that this will only happen for SimCLR (maybe NNCLR, but the queue should combat this), so in the meantime, all the other methods will work just fine.

vturrisi commented 2 years ago

Hi @nanhuayu can you check if #259 fixes your issues? The problem was that we were gathering z outside of the loss function.

nanhuayu commented 2 years ago

Thanks for your advice, I will try it now.

vturrisi commented 2 years ago

I tried with 2 gpus and the nce losses are exactly the same. Also, accuracy is ver similar. I think the PR is enough to fix your issues and I'm going to merge into main since it also fixes a couple of bugs that would plague users using other methods and the latest version of solo. If you issue is still not fixed, let me know and I'll look into it again

nanhuayu commented 2 years ago

The issue has been fixed! Thanks a lot!

vturrisi commented 2 years ago

Just for reference, we were gathering outside the loss as well, so we were just scaling down our gradients by the numbers of gpus. Also, there was a issue with scaling the learning rate twice.

nanhuayu commented 2 years ago

I checked the final results and found that the results of DDP in 4GPUs with 64 batch size were slightly different from those in a single GPU with 256 batch size. @vturrisi @DonkeyShot21

DonkeyShot21 commented 2 years ago

This is with sync batchnorm?

nanhuayu commented 2 years ago

ddp version is with sync_batchnorm.

main_pretrain.py --dataset cifar10 --backbone resnet18 --data_dir ./datasets --max_epochs 1000 --devices 0,1,2,3 --accelerator gpu --precision 16 --optimizer lars --grad_clip_lars --eta_lars 0.02 --exclude_bias_n_norm --scheduler warmup_cosine --lr 0.4 --classifier_lr 0.1 --weight_decay 1e-5 --batch_size 64 --num_workers 4 --crop_size 32 --brightness 0.8 --contrast 0.8 --saturation 0.8 --hue 0.2 --gaussian_prob 0.0 0.0 --crop_size 32 --num_crops_per_aug 1 1 --name simclr-cifar10-ddp4-lr4-6 --project solo --entity nanhuayu --wandb --save_checkpoint --auto_resume --method simclr --temperature 0.2 --proj_hidden_dim 2048 --proj_output_dim 256 --strategy ddp --sync_batchnorm

nanhuayu commented 2 years ago

I also trained the simclr with 256 batch sizes, the final results were almost the same with those of 64 batch sizes. It does not seem to be directly related to batch size. @vturrisi @DonkeyShot21

vturrisi / solo-learn

Some errors with multi GPUs in a single node. #253