Open XYZ-99 opened 1 month ago
The out-of-memory is actually due to exhaustion of fences. The message by Vulkan is a bit misleading. The driver is simply saying it cannot create more fences. For each camera, SAPIEN will create a fence to synchronize its work. When creating many cameras across many processes, it seems to hit a limit of the maximum fences that can be created system-wide defined by the driver. While we do not have a workaround for this issue, in SAPIEN 3 and ManiSkill3, we have introduced GPU-batched camera that allows running many cameras at once sharing the same fence. Simulating and rendering everything in the same process (even on the same GPU) is also much more efficient than ManiSKill2 (often 10-100x faster), so that is now the recommended way to use batched environments. If you would like to switch to ManiSkill3, you can find the documentation here https://maniskill.readthedocs.io/en/latest/user_guide/
For the frozen issue, I cannot really tell what is causing it. I may be able to take a look if you can reproduce it with pure SAPIEN code. SimplerEnv seems to be on top of ManiSkill2, which is 2 layers of encapsulations from SAPIEN and many things could go wrong.
@XYZ-99 since you are using SIMPLER I am still migrating that over to ManiSkill 3 so you can't quite use the new system yet. Hopefully I will have some example environments in SIMPLER ported over to the GPU sim + rendering system for faster batched evaluation, currently only have some of the CPU side of things done.
@fbxiang Thank you for your timely response!
I actually partially migrated SIMPLER just yesterday. See this for how to access the parallelized simpler environments (only bridge dataset atm) and how to run inference fast on it: https://maniskill.readthedocs.io/en/latest/tasks/digital_twins/index.html#bridgedata-v2-evaluation
Runs about 60-100x faster than real world evaluation speed, 10x faster than CPU sim (and higher if you have a good gpu)
@StoneT2000 Thank you for your reply! Actually I have just seen your tweets on GPU-parallelized read2sim. May I kindly ask your timeline on this thread of work on google robot (on variant aggregation and visual matching)? No rush, I'm just trying to assess how hard and necessary it is for me to port SIMPLER to ManiSkill3 myself, probably by imitating how you did it for bridge, if all I need is use your GPU-parallelized environment to accelerate the evaluation?
Google robot is complicated because it uses ruckig originally for the controller. I don't know if there exists a GPU parallelized version of the google robot controller out there, so someone would need to write cuda / pytorch code to imitate that controllers behavior. This is the hardest part and why I avoided adding google robot tasks for now.
It's possible but not trivial. As a result the timeline is uncertain. However there is also a possibility one can approximate the controller using existing maniskill controllers and tuning them heavily. I haven't investigated this deeply though.
It is currently easier now to add new environments for robots that use existing controllers (like pd joint delta pos, or ik based ones that control the end effector like the robot in bridge dataset)
Got it! It sounds non-trivial to take advantage of the GPU-parallelization for google robot then. I still highly appreciate your work on bridge and I look forward to your any forthcoming update on ManiSkill!
@fbxiang Actually deep down the camera is stuck at this line: https://github.com/haosulab/sapien-vulkan-2/blob/4914f8747a8cf9f0138c8dbc93e972df19a307d0/src/renderer/rt_renderer.cpp#L377 I don't know if this is going to help your diagnosis, just FYI.
System:
Describe the bug
multiprocessing
jobs, I do have seen in my own log files a traceback which might be related to this problem:I'm not sure if they are related. P.S. 1. I am 80% sure that my GPUs didn't go out of memory even if it says
ErrorOutOfHostMemory
. P.S. 2. This error has only been seen a few times when I launch multiprocessing jobs. But in most cases, multiprocessing also just gets stuck wthout throwing an error. P.S. 3. When I run with a single process, the bug can also occur (but never with this traceback; it's just frozen forever) but in that case it's very unlikely to go out of memory.To Reproduce
take_picture
ends up in a deadlock when it tries to acquire some kind of resources from the GPU?Expected behavior
The
gym.make
doesn't get stuck. Or at least it should throw an error.Screenshots
No.
Additional context
I tried on H100 and L40. This bug can occur on both types.