Camera take_picture gets stuck sometimes

XYZ-99 commented 1 month ago

System:

OS version: Ubuntu 22.04
Python version (if applicable): Python 3.10
SAPIEN version (pip freeze | grep sapien): 2.2.2
Environment: Server with xvfb

Describe the bug

Although this bug is encountered when I run SimplerEnv, deep down I think it's more related to how SAPIEN takes picture for cameras. But feel free to forward me to SimplerEnv if you think this issue fits in there instead.
When I create a new environment, using this, the process can sometimes (but not always) be frozen forever at this step to take picture. When it occurs, I can't even Ctrl-C the process. And it (almost) never throws an error.
Occasionally, when I launch multiprocessing jobs, I do have seen in my own log files a traceback which might be related to this problem:

Traceback (most recent call last):
  File "[...]/anaconda3/envs/legion_eval/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "[...]/anaconda3/envs/legion_eval/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "[...]/anaconda3/envs/legion_eval/lib/python3.10/site-packages/tianshou/env/worker/subproc.py", line 90, in _worker
    env_return = env.step(data)
  File "[...]/workspace/code/TriBench/tribench/eval/simpler/simplerenv_wrapper.py", line 202, in step
    obs, reward, done, truncated, info = self.env.step(
  File "[...]/anaconda3/envs/legion_eval/lib/python3.10/site-packages/gymnasium/wrappers/time_limit.py", line 57, in step
    observation, reward, terminated, truncated, info = self.env.step(action)
  File "[...]/anaconda3/envs/legion_eval/lib/python3.10/site-packages/gymnasium/wrappers/order_enforcing.py", line 56, in step
    return self.env.step(action)
  File "[...]/anaconda3/envs/legion_eval/lib/python3.10/site-packages/gymnasium/core.py", line 522, in step
    observation, reward, terminated, truncated, info = self.env.step(action)
  File "[...]/workspace/code/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/envs/custom_scenes/place_in_closed_drawer_in_scene.py", line 246, in step
    return super().step(action)
  File "[...]/workspace/code/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/envs/sapien_env.py", line 553, in step
    obs = self.get_obs()
  File "[...]/workspace/code/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/envs/custom_scenes/base_env.py", line 350, in get_obs
    obs = super().get_obs()
  File "[...]/workspace/code/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/envs/sapien_env.py", line 265, in get_obs
    return self._get_obs_images()
  File "[...]/workspace/code/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/envs/sapien_env.py", line 313, in _get_obs_images
    self.take_picture()
  File "[...]/workspace/code/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/envs/sapien_env.py", line 289, in take_picture
    cam.take_picture()
  File "[...]/workspace/code/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/sensors/camera.py", line 187, in take_picture
    self.camera.take_picture()
RuntimeError: vk::Device::createFenceUnique: ErrorOutOfHostMemory

I'm not sure if they are related. P.S. 1. I am 80% sure that my GPUs didn't go out of memory even if it says ErrorOutOfHostMemory. P.S. 2. This error has only been seen a few times when I launch multiprocessing jobs. But in most cases, multiprocessing also just gets stuck wthout throwing an error. P.S. 3. When I run with a single process, the bug can also occur (but never with this traceback; it's just frozen forever) but in that case it's very unlikely to go out of memory.

To Reproduce

As far as I can tell, this is a probabilistic bug, and might have something to do with GPU models or system-level software. I'm not sure what combination can reproduce the bug, but I'm willing to provide any further information to break this down.
Since I am running SimplerEnv to evaluate my own agent model, the probability of this bug can vary with different models. Thus, I guess it might have something to with the GPU or the driver? Maybe somehow take_picture ends up in a deadlock when it tries to acquire some kind of resources from the GPU?
When I don't load the model and only try to instantiate the env, it seems fine and extremely unlikely to get stuck (I created an infinite loop to repetitively instantiate the env and close it and let it run for an hour without getting stuck).

Expected behavior

The gym.make doesn't get stuck. Or at least it should throw an error.

Screenshots

No.

Additional context

I tried on H100 and L40. This bug can occur on both types.

fbxiang commented 1 month ago

The out-of-memory is actually due to exhaustion of fences. The message by Vulkan is a bit misleading. The driver is simply saying it cannot create more fences. For each camera, SAPIEN will create a fence to synchronize its work. When creating many cameras across many processes, it seems to hit a limit of the maximum fences that can be created system-wide defined by the driver. While we do not have a workaround for this issue, in SAPIEN 3 and ManiSkill3, we have introduced GPU-batched camera that allows running many cameras at once sharing the same fence. Simulating and rendering everything in the same process (even on the same GPU) is also much more efficient than ManiSKill2 (often 10-100x faster), so that is now the recommended way to use batched environments. If you would like to switch to ManiSkill3, you can find the documentation here https://maniskill.readthedocs.io/en/latest/user_guide/

For the frozen issue, I cannot really tell what is causing it. I may be able to take a look if you can reproduce it with pure SAPIEN code. SimplerEnv seems to be on top of ManiSkill2, which is 2 layers of encapsulations from SAPIEN and many things could go wrong.

StoneT2000 commented 1 month ago

@XYZ-99 since you are using SIMPLER I am still migrating that over to ManiSkill 3 so you can't quite use the new system yet. Hopefully I will have some example environments in SIMPLER ported over to the GPU sim + rendering system for faster batched evaluation, currently only have some of the CPU side of things done.

XYZ-99 commented 1 month ago

@fbxiang Thank you for your timely response!

"The out-of-memory is actually due to exhaustion of fences." Then it is strange enough that when I run the same script multiple times with a fixed number of environments, this error sometimes occurs but sometimes doesn't.
Also, if the fence is used to "synchronize", it seems suspicious to me if my "stuck" problem arises from a deadlock when the processes try to synchronize. I will do more investigation and share any information that I think might be helpful to you to identify the problem.
Thank you for the reminder of the release of ManiSkill3! However, since my use case is mostly in SIMPLER, it might be non-trivial to myself how to migrate SIMPLER to ManiSkill3.

StoneT2000 commented 1 month ago

I actually partially migrated SIMPLER just yesterday. See this for how to access the parallelized simpler environments (only bridge dataset atm) and how to run inference fast on it: https://maniskill.readthedocs.io/en/latest/tasks/digital_twins/index.html#bridgedata-v2-evaluation

Runs about 60-100x faster than real world evaluation speed, 10x faster than CPU sim (and higher if you have a good gpu)

XYZ-99 commented 1 month ago

@StoneT2000 Thank you for your reply! Actually I have just seen your tweets on GPU-parallelized read2sim. May I kindly ask your timeline on this thread of work on google robot (on variant aggregation and visual matching)? No rush, I'm just trying to assess how hard and necessary it is for me to port SIMPLER to ManiSkill3 myself, probably by imitating how you did it for bridge, if all I need is use your GPU-parallelized environment to accelerate the evaluation?

StoneT2000 commented 1 month ago

Google robot is complicated because it uses ruckig originally for the controller. I don't know if there exists a GPU parallelized version of the google robot controller out there, so someone would need to write cuda / pytorch code to imitate that controllers behavior. This is the hardest part and why I avoided adding google robot tasks for now.

It's possible but not trivial. As a result the timeline is uncertain. However there is also a possibility one can approximate the controller using existing maniskill controllers and tuning them heavily. I haven't investigated this deeply though.

It is currently easier now to add new environments for robots that use existing controllers (like pd joint delta pos, or ik based ones that control the end effector like the robot in bridge dataset)

XYZ-99 commented 1 month ago

Got it! It sounds non-trivial to take advantage of the GPU-parallelization for google robot then. I still highly appreciate your work on bridge and I look forward to your any forthcoming update on ManiSkill!

XYZ-99 commented 3 weeks ago

@fbxiang Actually deep down the camera is stuck at this line: https://github.com/haosulab/sapien-vulkan-2/blob/4914f8747a8cf9f0138c8dbc93e972df19a307d0/src/renderer/rt_renderer.cpp#L377 I don't know if this is going to help your diagnosis, just FYI.

haosulab / SAPIEN

Camera take_picture gets stuck sometimes #171