Closed rkdgmlqja closed 1 year ago
Hi,
I notice you use this line for distributed training: nn.DataParallel(model).to(device)
I recommend you trying this version:
device_ids = [0, 1] #your GPU index
model = torch.nn.DataParallel(model, device_ids=device_ids)
Hi, thank you for your help.
I tried with your
device_ids = [0, 1] #your GPU index
model = torch.nn.DataParallel(model, device_ids=device_ids)
code and same error appeared.
Traceback (most recent call last):
File "/home/hubo1024/PycharmProjects/snntorch/15epoch_50k.py", line 149, in <module>
spk_rec = model(data)
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hubo1024/PycharmProjects/snntorch/15epoch_50k.py", line 98, in forward
out = self.layer1(input_torch)
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/snntorch/_neurons/leaky.py", line 162, in forward
self.mem = self.state_fn(input_)
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/snntorch/_neurons/leaky.py", line 201, in _build_state_function_hidden
self._base_state_function_hidden(input_) - self.reset * self.threshold
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/snntorch/_neurons/leaky.py", line 195, in _base_state_function_hidden
base_fn = self.beta.clamp(0, 1) * self.mem + input_
File "/home/hubo1024/anaconda3/envs/snn_torch/lib/python3.9/site-packages/torch/_tensor.py", line 1121, in __torch_function__
ret = func(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Process finished with exit code 1
aslo I udated the enviroment info aobove just in case
thank you again.
Hi,
I debug with the Leaky class. You are right, The value of self.mem
variable of Leaky class has been altered unexpected when employ the DataParallel, causing this error. However, I have not found the reason yet. To solve this temporarily, I recommend you to change the source code slightly of the Leaky class. You can find this in line 161 of the snntorch/_neurons/leaky.py:
...
if self.init_hidden:
self._leaky_forward_cases(mem)
self.reset = self.mem_reset(self.mem)
self.mem = self._state_fn(input_)
...
You can alter them with this version:
...
if self.init_hidden:
self._leaky_forward_cases(mem)
self.reset = self.mem_reset(self.mem)
self.mem = self._build_state_function_hidden(input_)
...
This works correct in my Dual-GPU workstation.
Hi, I've pushed a PR for fixing this issue, after merging into the master branch, you can clone it and the problem will be solved! #156
Made the same fix for other neurons too. #161
Fri Dec 2 11:16:53 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA RTX A6000 On | 00000000:1B:00.0 Off | Off | | 30% 30C P8 29W / 300W | 1MiB / 48682MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA RTX A6000 On | 00000000:1C:00.0 Off | Off | | 30% 27C P8 22W / 300W | 1MiB / 48685MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA RTX A6000 On | 00000000:1D:00.0 Off | Off | | 30% 32C P8 23W / 300W | 1MiB / 48685MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA RTX A6000 On | 00000000:1E:00.0 Off | Off | | 30% 31C P8 23W / 300W | 1MiB / 48685MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA RTX A6000 On | 00000000:3D:00.0 Off | Off | | 30% 27C P8 22W / 300W | 1MiB / 48685MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA RTX A6000 On | 00000000:3F:00.0 Off | Off | | 30% 29C P8 23W / 300W | 1MiB / 48685MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA RTX A6000 On | 00000000:40:00.0 Off | Off | | 30% 27C P8 22W / 300W | 1MiB / 48685MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA RTX A6000 On | 00000000:41:00.0 Off | Off | | 30% 30C P8 22W / 300W | 1MiB / 48685MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
and here's error
I rerun this code after removing snn.Leaky layer in CNN and it worked fine. (of course the cost doesn't converge and accuracy was 0% but still it runs) So I assume that the reason of this error is snn.Leaky layer. I think changing