Error when run ./train_mpe_spread.sh

ChuangZhang1999 commented 10 months ago

When I tried to run ./train_mpe_spread.sh, I met the following issue:

obs_space:  [Box(18,), Box(18,), Box(18,)]
share_obs_space:  [Box(54,), Box(54,), Box(54,)]
act_space:  [Discrete(5), Discrete(5), Discrete(5)]
Traceback (most recent call last):
  File "../train/train_mpe.py", line 174, in <module>
    main(sys.argv[1:])
  File "../train/train_mpe.py", line 159, in main
    runner.run()
  File "/mnt/nvme1n1/zhangchuang_23/MARL/on-policy-main/onpolicy/runner/shared/mpe_runner.py", line 28, in run
    values, actions, action_log_probs, rnn_states, rnn_states_critic, actions_env = self.collect(step)
  File "/home/zhangchuang_23/envs/MARL/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/nvme1n1/zhangchuang_23/MARL/on-policy-main/onpolicy/runner/shared/mpe_runner.py", line 103, in collect
    np.concatenate(self.buffer.masks[step]))
  File "/mnt/nvme1n1/zhangchuang_23/MARL/on-policy-main/onpolicy/algorithms/r_mappo/algorithm/rMAPPOPolicy.py", line 71, in get_actions    deterministic)
  File "/home/zhangchuang_23/envs/MARL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nvme1n1/zhangchuang_23/MARL/on-policy-main/onpolicy/algorithms/r_mappo/algorithm/r_actor_critic.py", line 64, in forward
    actor_features = self.base(obs)
  File "/home/zhangchuang_23/envs/MARL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nvme1n1/zhangchuang_23/MARL/on-policy-main/onpolicy/algorithms/utils/mlp.py", line 56, in forward
    x = self.mlp(x)
  File "/home/zhangchuang_23/envs/MARL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/nvme1n1/zhangchuang_23/MARL/on-policy-main/onpolicy/algorithms/utils/mlp.py", line 27, in forward
    x = self.fc1(x)
  File "/home/zhangchuang_23/envs/MARL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zhangchuang_23/envs/MARL/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/zhangchuang_23/envs/MARL/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zhangchuang_23/envs/MARL/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/zhangchuang_23/envs/MARL/lib/python3.6/site-packages/torch/nn/functional.py", line 1610, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

satpreetsingh commented 7 months ago

Try running this and see if you still get the error.

import torch
print("Is CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("cuDNN version:", torch.backends.cudnn.version())

a = torch.randn(1024, 1024, device="cuda:0")
b = torch.randn(1024, 1024, device="cuda:0")
c = torch.matmul(a, b)  # Matrix multiplication
print("Matrix multiplication result shape:", c.shape)

If so, you need to fix your PyTorch/CUDA installation. Try

conda install pytorch  -c pytorch

zoeyuchao commented 4 months ago

Fixed！try the new code！

marlbenchmark / on-policy

Error when run ./train_mpe_spread.sh #98