GPU 0 is always used in a multi-GPU setup

nikolai-franke commented 1 year ago

System:

OS version: Red Hat Enterprise Linux (RHEL) 8.x
Python version: Python 3.10 and Python 3.9
SAPIEN version: sapien==2.2.2
Environment: Server with xvfb

Describe the bug SAPIEN always uses GPU 0 in multi-GPU setup in addition to the GPU specified by CUDA_VISIBLE_DEVICES

To Reproduce

Run modified examples/robotics/basic_robot.py script (the only difference is that there is no Viewer) https://pastebin.com/abuJeuVG with CUDA_VISIBLE_DEVICES=0
Run modified examples/robotics/basic_robot.py script (the only difference is that there is no Viewer) https://pastebin.com/abuJeuVG with CUDA_VISIBLE_DEVICES=1

Expected behavior Checking the GPU usage, only the selected GPU should be used. For CUDA_VISIBLE_DEVICES=0, that is the case. For CUDA_VISIBLE_DEVICES=1, both GPU 0 and GPU 1 get used.

Screenshots CUDA_VISIBLE_DEVICES=0: cuda_0 CUDA_VISIBLE_DEVICES=1: cuda_1

Additional context Even though GPU 0 only gets used a bit when CUDA_VISIBLE_DEVICES=1, this usage quickly adds up when running many parallel simulations. I am using ManiSkill2 for Reinforcement Learning on an HPC node with 4 Nvidia A100 GPUs and this bug severely limits the number of parallel environments I can run. Additionally, running many parallel environments becomes slow, since GPU 0 is used by every single simulation environment instead of just 1/4th of the simulations.

fbxiang commented 1 year ago

You may try passing offscreen_only=True to SapienRenderer constructor. This behavior will be changed in the future (to make CUDA device take higher priority than on-screen rendering)

nikolai-franke commented 1 year ago

Passing offscreen_only=True doesn't make a difference.

fbxiang commented 1 year ago

I cannot figure out what is causing the issue. I think you should set the pci id of the device you want to use directly. This method requires a bit setup but should never fail. First, before creating anything with SAPIEN, run sapien.SapienRenderer.set_log_level("info"). Next, run your code. You will see a table listing devices visible to Vulkan. From there, you will see all your GPUs with a field PciBus. The PciBus is unique to each of your physical GPU. Next when you create SapienRenderer, you can pass device="pci:x" where x is the PciBus id shown in the log. This should bypass all other checks.

nikolai-franke commented 1 year ago

Thank you very much for your answer! Sadly the result is still exactly the same. GPU 0 always gets used, even when selecting another GPU via PCI address.

fbxiang commented 1 year ago

Are you using sapien==2.2.2? I have verified that the GPU selection feature is working. You can try sapien.SapienRenderer.set_log_level("info") before creating the renderer. It will list all available GPUs to the console and tell you which GPU is selected for rendering. Since an incorrect pci id will result in an error, I guess that maybe some other program is running on your GPU 0 and it is not SAPIEN renderer.

balazsgyenes commented 11 months ago

I'm actually having the same issue.

haosulab / SAPIEN

GPU 0 is always used in a multi-GPU setup #139