haosulab / SAPIEN

SAPIEN Embodied AI Platform
https://sapien.ucsd.edu/
MIT License
413 stars 40 forks source link

Setting device for SapienRenderer doesn't work #170

Open XYZ-99 opened 3 months ago

XYZ-99 commented 3 months ago

System:

Describe the bug

139 mentions setting device for SapienRenderer like this:

sapien_renderer = SapienRenderer(..., device="pci:0")

However, it seems the device somehow couldn't be found—

[2024-08-05 16:43:29.704] [svulkan2] [error] GLFW error: X11: The DISPLAY environment variable is missing
[2024-08-05 16:43:29.704] [svulkan2] [warning] Continue without GLFW.
[2024-08-05 16:43:29.944] [svulkan2] [info] Vulkan instance initialized
[2024-08-05 16:43:29.944] [svulkan2] [info] Devices visible to Vulkan
 Id                                    name   Present Supported    PciBus    CudaId     RayTracing
  0                   NVIDIA H100 80GB HBM3         0         1         0         0              1
  1                   NVIDIA H100 80GB HBM3         0         1         0         0              1

[2024-08-05 16:43:29.944] [svulkan2] [info] Devices visible to Cuda
    CudaId    PciBus             PciBusString
         0         0             000A:00:00.0
         1         0             000B:00:00.0

[2024-08-05 16:43:29.944] [svulkan2] [info] Vulkan finished
0it [00:22, ?it/s]
Traceback:
...
File "[...]/ManiSkill2_real2sim/mani_skill2_real2sim/envs/sapien_env.py", line 107, in __init__
    self._renderer = sapien.SapienRenderer(**renderer_kwargs)
RuntimeError: Cannot find cuda device suitable for rendering cuda:1

P.S. [error] GLFW error: X11: The DISPLAY environment variable is missing isn't a real error, since I can still run the code, if I don't specify device for SapienRenderer.

I tried both "cuda:1" and "pci:1" but neither worked.

However, my issue shouldn't be the same as #115 because I can run the code without specifying the device.

Could you tell me what the device format should be?

fbxiang commented 2 months ago

I am recently getting many different types of issues related to H100, probably because this GPU does not even include rendering cores to run graphics workloads. Your issue seems like a new one. First, you can try using SAPIEN 3.0.0b1, SAPIEN 2 is a bit too old. Next you can try adding environment variable SAPIEN_DISABLE_RAY_TRACING=1, somehow simply enabling ray tracing can break H100 completely, even if the driver decides to report it can support ray tracing. For now my own workaround is just to avoid H100 altogether as it is not a good choice for rendering anyway.

fbxiang commented 2 months ago

Regarding the X11 error, your observation is correct. SAPIEN logs an "error" when it can successfully workaround it (in this case, SAPIEN simply disables on-screen display), otherwise it throws an exception.