Closed mbecker12 closed 3 years ago
Check device property of all tensors in the actor process. Including state tensors, actions, q_values, etc
Also, Using a T4 instead of a V100 didn't seem to work.
Besides, one should confirm that a cpu actor process still works.
The following error is raised when trying to use a T4:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/surface-rl-decoder/src/distributed/actor.py", line 153, in actor
actions, q_values = select_actions(
File "/surface-rl-decoder/src/distributed/util.py", line 38, in select_actions
policy_net_output = model(state)
File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/surface-rl-decoder/src/distributed/dummy_agent.py", line 41, in forward
x = F.relu(self.lin1(x))
File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`
Similar error for the learner process (when actor processes are forced onto cpu:
ERROR:main:An error occurred!
ERROR:main:Traceback (most recent call last):
File "/surface-rl-decoder/src/distributed/start_distributed_mp.py", line 218, in start_mp
learner(learner_args)
File "/surface-rl-decoder/src/distributed/learner.py", line 173, in learner
indices, priorities = perform_q_learning_step(
File "/surface-rl-decoder/src/distributed/learner_util.py", line 135, in perform_q_learning_step
policy_output = policy_net(batch_state)
File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/surface-rl-decoder/src/distributed/dummy_agent.py", line 41, in forward
x = F.relu(self.lin1(x))
File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`
INFO:main:Saving Metadata
INFO:main:Training Done!
CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`
Current progress in issue #47
Sent support request to Alvis support staff