Getting "RuntimeError: Cannot find cuda device suitable for rendering cuda:0" when training an RL agent

Pigbrainlyh commented 1 year ago

System:

OS version: Ubuntu 22.04 LTS
Python version (if applicable): Python 3.8.10

Describe the bug I have used sapien to render camera pictures as observations in a gym environment. However, the following error occured during training.

Traceback (most recent call last):
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../utils/gym_env_utils.py", line 44, in _safe_run_worker
    observation = env.reset()
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/gym/wrappers/order_enforcing.py", line 16, in reset
    return self.env.reset(**kwargs)
  File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../envs/continuous_insertion.py", line 1292, in reset
    obs = super().reset(specify_offset)
  File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../envs/continuous_insertion.py", line 603, in reset
    reset_ok, sim_offset = self.__initialize__(sim_offset)
  File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../envs/continuous_insertion.py", line 382, in __initialize__
    renderer = sapien.VulkanRenderer(device="cuda:0", offscreen_only=True)
RuntimeError: Cannot find cuda device suitable for rendering cuda:0

The agent had already interacted with the gym environment before the above error occured, and the memory for the corresponding gpu is sufficient. The agent model is on cuda:2 device. I want to know why the error could occur and how to fix it.

Expected behavior Successfully obtain a renderer and set it to the sapien engine.

Additional context Also in the beginning I run the code without device="cuda:0", and another error occured

File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../encoded_env/custom_td3.py", line 341, in learn
    return super().learn(
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 356, in learn
    rollout = self.collect_rollouts(
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 597, in collect_rollouts
    if callback.on_step() is False:
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 100, in on_step
    return self._on_step()
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 204, in _on_step
    continue_training = callback.on_step() and continue_training
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 100, in on_step
    return self._on_step()
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 447, in _on_step
    episode_rewards, episode_lengths = evaluate_policy(
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/evaluation.py", line 86, in evaluate_policy
    actions, states = model.predict(observations, state=states, episode_start=episode_starts, deterministic=deterministic)
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 632, in predict
    return self.policy.predict(observation, state, episode_start, deterministic)
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 333, in predict
    observation, vectorized_env = self.obs_to_tensor(observation)
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 252, in obs_to_tensor
    observation = obs_as_tensor(observation, self.device)
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 466, in obs_as_tensor
    return {key: th.as_tensor(_obs).to(device) for (key, _obs) in obs.items()}
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 466, in <dictcomp>
    return {key: th.as_tensor(_obs).to(device) for (key, _obs) in obs.items()}
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I guess it was because the renderer and the agent used the same gpu, and this is the reason I chose to specify the device of the renderer.

fbxiang commented 12 months ago

Did you set cuda:2 in your torch code or did you set CUDA_VISIBLE_DEVICES=2? CUDA_VISIBLE_DEVICES=2 will make the GPU with index 2 to become cuda:0.

Pigbrainlyh commented 12 months ago

Did you set cuda:2 in your torch code or did you set CUDA_VISIBLE_DEVICES=2? CUDA_VISIBLE_DEVICES=2 will make the GPU with index 2 to become cuda:0.

I set cuda:2 in my torch code.

fbxiang commented 11 months ago

Given the current information, it is very hard to understand what is causing the error. We are currently developing SAPIEN 3, and it should come with more error detection and workarounds in the renderer. It will have a few API changes and I am not sure if it will greatly affect your code. Our latest development version will always be posted to the release page (Nightly Release). You can try it out and see if the error still exists.

minghaoguo20 commented 1 month ago

Hi, I meet the same error. Is there any possible solution / related issue to this issue? Thanks.

guangnianyuji commented 1 month ago

I meet the same error. Any solution will be highly appreciated....🧎🧎🧎

StoneT2000 commented 1 month ago

what version of sapien are you using? And are you testing this via ManiSkill?

eyrs42 commented 1 month ago

Hi, I met with this error too. I am using sapien version of 2.2.2. Tried with NVIDIA A40 and A5000.

I was testing it via ManiSkill, but the error occurs even if I run:

>>> import sapien.core as sapien
>>> sapien.SapienRenderer(offscreen_only=True, device="cuda:0")

Any help would be highly appreciated.

StoneT2000 commented 1 month ago

@srye2 can you install sapien==3.0.0b1 and run

sapien info --all

If it doesn't work please follow https://maniskill.readthedocs.io/en/latest/user_guide/getting_started/installation.html#troubleshooting, make sure the relevant packages and files are installed

eyrs42 commented 1 month ago

Upgrading worked, thank you so much!

haosulab / SAPIEN

Getting "RuntimeError: Cannot find cuda device suitable for rendering cuda:0" when training an RL agent #142