Open tapantabhanja opened 1 year ago
@tapantabhanja , have you resolved the problem?
@ahenkes1 Unfortunately no. After this did not work on my laptop, I made a banal attempt of solving the CUDA Out of Memory by running the code on an HPC Cluster. The GPUs I used there were much powerful and had more memory to use. Unfortunately, I had the same error there too. Which made me realise that the problem stemmed from somewhere else. I still could not figure out this.
I would be so much grateful for some help.
Description
The idea was to convert a VGG-16 network to its equivalent spiking version and train it with the cats and dogs image dataset.
The initial problem was that the model did not load on my GPU. It threw the error CUDA Out-Of-memory. Although, my model size was much less than the GPU memory. I could not detect where the problem in the code lies. To go around, I thought using a multi-gpu training would be a good idea. The Deepspeed strategy might help break my model down and allocate it to multiple GPUs and thus help train my model. I tried out the PyTorch Lightning Library which provides an easier API to implement Multi-GPU training. But this also was unsuccessful due to timed out initialisation of process groups. My code and staketrace follow:
What I Did
My code:
My staketrace for CUDA Out of Memory error:
File "/work/bhanja/example_training/training_neuro_cats_dogs.py", line 386, in <module> spk_record, mem_pot_record = n_model(train_images) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/work/bhanja/example_training/training_neuro_cats_dogs.py", line 183, in forward out = self.layer2(out) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/snntorch/_neurons/leaky.py", line 193, in forward self.reset = self.mem_reset(self.mem) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/snntorch/_neurons/neurons.py", line 107, in mem_reset reset = self.spike_grad(mem_shift).clone().detach() File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/snntorch/surrogate.py", line 210, in inner return ATan.apply(x, alpha) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/snntorch/surrogate.py", line 189, in forward out = (input_ > 0).float() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 31.74 GiB total capacity; 30.76 GiB already allocated; 3.12 MiB free; 31.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My staketrace for Timed Out Initialisation of Process Groups:
File "/work/bhanja/example_training/training_neuro_cats_dogs.py", line 336, in <module> fabric.launch() File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 664, in launch return self._strategy.launcher.launch(function, *args, **kwargs) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/lightning/fabric/strategies/launchers/subprocess_script.py", line 90, in launch return function(*args, **kwargs) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 749, in _run_with_setup self._strategy.setup_environment() File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/lightning/fabric/strategies/ddp.py", line 113, in setup_environment self._setup_distributed() File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/lightning/fabric/strategies/deepspeed.py", line 576, in _setup_distributed self._init_deepspeed_distributed() File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/lightning/fabric/strategies/deepspeed.py", line 594, in _init_deepspeed_distributed deepspeed.init_distributed(self._process_group_backend, distributed_port=self.cluster_environment.main_port) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 116, in __init__ self.init_process_group(backend, timeout, init_method, rank, world_size) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 142, in init_process_group torch.distributed.init_process_group(backend, File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 932, in init_process_group _store_based_barrier(rank, store, timeout) File "/home/bhanja/.conda/envs/spikingjelly/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 469, in _store_based_barrier raise RuntimeError( RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=1, timeout=0:30:00)
What am I doing wrong? I could not find a solution for this.