vturrisi / solo-learn

solo-learn: a library of self-supervised methods for visual representation learning powered by Pytorch Lightning
MIT License
1.41k stars 182 forks source link

nnclr occurs error on single machine multi-gpu case #267

Closed HuangChiEn closed 2 years ago

HuangChiEn commented 2 years ago

Hello~ Thank you for providing this SSL library and rapid response a lot of issue.

As the title described, i have encountered the error on single machine multi-gpu mode (greater then one GPU) in nnclr.sh, and it occurs the error in trainer.fit function. The error message is shown as following figure :

Traceback (most recent call last): File "../../../main_pretrain.py", line 207, in main() File "../../../main_pretrain.py", line 203, in main trainer.fit(model, train_loader, val_loader, ckpt_path=ckpt_path) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, *kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/spawn.py", line 82, in launch start_method=self._start_method, File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes process.start() File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init self._launch(process_obj) File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/opt/conda/lib/python3.7/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'TorchGraph.create_forward_hook..after_forward_hook' Traceback (most recent call last): File "../../../main_pretrain.py", line 207, in main() File "../../../main_pretrain.py", line 203, in main trainer.fit(model, train_loader, val_loader, ckpt_path=ckpt_path) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, args, trainer=self, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/spawn.py", line 82, in launch start_method=self._start_method, File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes process.start() File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/opt/conda/lib/python3.7/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/opt/conda/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init self._launch(process_obj) File "/opt/conda/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/opt/conda/lib/python3.7/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'TorchGraph.create_forward_hook..after_forward_hook'

I have run the pip install -r requirement.txt

image

The other information about the pkg version :

image

However, it will go smoothing with the single GPU, the following illustrate the config of bash file

python3 ../../../main_pretrain.py \ --dataset cifar100 \ --backbone resnet18 \ --data_dir /data \ --max_epochs 1000 \ --devices 0,1 \ --accelerator gpu \ --precision 16 \ --optimizer lars \ --grad_clip_lars \ --eta_lars 0.02 \ --exclude_bias_n_norm \ --scheduler warmup_cosine \ --lr 0.4 \ --classifier_lr 0.1 \ --weight_decay 1e-5 \ --batch_size 256 \ --num_workers 4 \ --brightness 0.4 \ --contrast 0.4 \ --saturation 0.2 \ --hue 0.1 \ --gaussian_prob 0.0 0.0 \ --solarization_prob 0.0 0.2 \ --crop_size 32 \ --num_crops_per_aug 1 1 \ --name nnclr-cifar \ --project lw-ssl \ --entity josef \ --wandb \ --save_checkpoint \ --auto_resume \ --method nnclr \ --temperature 0.2 \ --proj_hidden_dim 2048 \ --pred_hidden_dim 4096 \ --proj_output_dim 256 \ --queue_size 65536

Thanks for having a look, any suggestion will be appreciated!!

vturrisi commented 2 years ago

Hey, you are just missing  --strategy ddp in your bash file.

HuangChiEn commented 2 years ago

Hey, you are just missing  --strategy ddp in your bash file.

The issue is solved, thank you for your help ~ ~