Array and tensor device incompatibility

mbecker12 commented 3 years ago

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/surface-rl-decoder/src/distributed/actor.py", line 153, in actor
    actions, q_values = select_actions(
  File "/surface-rl-decoder/src/distributed/util.py", line 55, in select_actions
    torch.softmax(policy_net_output[non_greedy_indices], dim=1, dtype=torch.float32)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

mbecker12 commented 3 years ago

Check device property of all tensors in the actor process. Including state tensors, actions, q_values, etc

Also, Using a T4 instead of a V100 didn't seem to work.

Besides, one should confirm that a cpu actor process still works.

mbecker12 commented 3 years ago

The following error is raised when trying to use a T4:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/surface-rl-decoder/src/distributed/actor.py", line 153, in actor
    actions, q_values = select_actions(
  File "/surface-rl-decoder/src/distributed/util.py", line 38, in select_actions
    policy_net_output = model(state)
  File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/surface-rl-decoder/src/distributed/dummy_agent.py", line 41, in forward
    x = F.relu(self.lin1(x))
  File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

mbecker12 commented 3 years ago

Similar error for the learner process (when actor processes are forced onto cpu:

ERROR:main:An error occurred!
ERROR:main:Traceback (most recent call last):
  File "/surface-rl-decoder/src/distributed/start_distributed_mp.py", line 218, in start_mp
    learner(learner_args)
  File "/surface-rl-decoder/src/distributed/learner.py", line 173, in learner
    indices, priorities = perform_q_learning_step(
  File "/surface-rl-decoder/src/distributed/learner_util.py", line 135, in perform_q_learning_step
    policy_output = policy_net(batch_state)
  File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/surface-rl-decoder/src/distributed/dummy_agent.py", line 41, in forward
    x = F.relu(self.lin1(x))
  File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/surface-rl-decoder/virtualenv/qec/lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

INFO:main:Saving Metadata
INFO:main:Training Done!
CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

mbecker12 commented 3 years ago

Current progress in issue #47

mbecker12 commented 3 years ago

Sent support request to Alvis support staff

mbecker12 / surface-rl-decoder

Array and tensor device incompatibility #57