salesforce / warp-drive

Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning Framework on a GPU (JMLR 2022)
BSD 3-Clause "New" or "Revised" License
465 stars 78 forks source link

Error: Invalid Resource handle. #31

Closed Ma-Weijian closed 2 years ago

Ma-Weijian commented 2 years ago

Hello WarpDrive Team,

A good MARL library indeed. I have tried this library on an old machine and it works fine.

However, when I moved to a new machine, I met the following error.

(warp_drive) ***@***-lab-gpu:~/warp-drive-master/warp_drive$ python training/example_training_script.py --env tag_continuous --num_gpus 1 --results_dir ..
We have successfully found 1 GPUs!
Training with 1 GPU(s).
Traceback (most recent call last):
  File "training/example_training_script.py", line 224, in <module>
    setup_trainer_and_train(run_config, results_directory=results_dir)
  File "training/example_training_script.py", line 126, in setup_trainer_and_train
    trainer.train()
  File "/home/mwj/warp-drive-master/warp_drive/training/trainer.py", line 402, in train
    metrics = self._update_model_params(iteration)
  File "/home/mwj/warp-drive-master/warp_drive/training/trainer.py", line 741, in _update_model_params
    loss.backward()
  File "/home/mwj/anaconda3/envs/warp_drive/lib/python3.7/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/mwj/anaconda3/envs/warp_drive/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: Event device type CUDA does not match blocking stream's device type CPU.
Exception ignored in: <function CUDASampler.__del__ at 0x7f86b065e9e0>
Traceback (most recent call last):
  File "/home/mwj/warp-drive-master/warp_drive/managers/function_manager.py", line 637, in __del__
    free(block=self._block, grid=self._grid)
  File "/home/mwj/anaconda3/envs/warp_drive/lib/python3.7/site-packages/pycuda/driver.py", line 480, in function_call
    func._set_block_shape(*block)
pycuda._driver.LogicError: cuFuncSetBlockShape failed: invalid resource handle

And my nvidia-smi command looks like this.

Tue Apr  5 23:10:52 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01    Driver Version: 510.39.01    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 30%   24C    P8    34W / 350W |    326MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1268      G   /usr/lib/xorg/Xorg                 35MiB |
|    0   N/A  N/A      1771      G   /usr/lib/xorg/Xorg                144MiB |
|    0   N/A  N/A      1884      G   /usr/bin/gnome-shell               55MiB |
|    0   N/A  N/A      3043      G   gnome-control-center               12MiB |
|    0   N/A  N/A      6784      G   ...792671094337050779,131072       46MiB |
|    0   N/A  N/A     12488      G   ...RendererForSitePerProcess       15MiB |
+-----------------------------------------------------------------------------+

The result of running run_unittest.py looks like this.

(warp_drive) mwj@mwj-lab-gpu:~/warp-drive-master/warp_drive$ python utils/run_unittests.py
Running Unit tests ... 
/home/mwj/warp-drive-master/warp_drive/cuda_includes/../../example_envs/tag_gridworld/tag_gridworld_step.cu(151): warning #2361-D: invalid narrowing conversion from "unsigned int" to "int"

====================================================================================== test session starts =======================================================================================
platform linux -- Python 3.7.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/mwj/warp-drive-master
collected 13 items                                                                                                                                                                               

../tests/example_envs/test_tag_continuous.py .                                                                                                                                             [  7%]
../tests/example_envs/test_tag_gridworld.py .                                                                                                                                              [ 15%]
../tests/example_envs/test_tag_gridworld_step_cuda.py .                                                                                                                                    [ 23%]
../tests/example_envs/test_tag_gridworld_step_python.py ..                                                                                                                                 [ 38%]
../tests/warp_drive/test_action_sampler.py ...                                                                                                                                             [ 61%]
../tests/warp_drive/test_data_manager.py ...                                                                                                                                               [ 84%]
../tests/warp_drive/test_env_reset.py .                                                                                                                                                    [ 92%]
../tests/warp_drive/test_function_manager.py .                                                                                                                                             [100%]

======================================================================================== warnings summary ========================================================================================
../../anaconda3/envs/warp_drive/lib/python3.7/site-packages/gym/envs/registration.py:250
  /home/mwj/anaconda3/envs/warp_drive/lib/python3.7/site-packages/gym/envs/registration.py:250: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
    for plugin in metadata.entry_points().get(entry_point, []):

../../anaconda3/envs/warp_drive/lib/python3.7/site-packages/pycuda/compyte/dtypes.py:120
  /home/mwj/anaconda3/envs/warp_drive/lib/python3.7/site-packages/pycuda/compyte/dtypes.py:120: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    reg.get_or_register_dtype("bool", np.bool)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
================================================================================= 13 passed, 2 warnings in 5.38s =================================================================================

As the unit tests have passed, I think the cuda version mismatch may not be an issue.

Also, as there are many other environments on this machine, I wonder if there exists a solution to change my environment as little as possible.

So what can I do to fix this issue? Any idea helps.

Many thanks!

Emerald01 commented 2 years ago

Hello @Ma-Weijian

Thank you for trying WarpDrive and we are glad to hear you like it. Regarding this error, it is most likely caused by some bug from the latest version of pytorch (e.g., version 1.11.0), and we can reproduce your error as well

File "/warp_drive/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass

Since in our installation requirements, torch >= 1.9, you might therefore have installed the latest 1.11 version that seems to have this issue in the backward() function. I suggest you downgrade torch to version 1.9.0 and that should solve your problem. We will also update the requirements soon.

Please let us know if this can resolve your problem. Thank you.

Regards,

Ma-Weijian commented 2 years ago

Hi, @Emerald01 thanks for the solution, I changed my pytorch version into 1.10 and it works.

It seems that the official pytorch installation guide of version 1.9 is faulty and my rtx 3090 is not supported.

Anyway, as pytorch has changed their mainstream version into 1.11, I think you'd better check the reason for this pytorch error if you have time.

Thanks and best regards.