haosulab / ManiSkill

SAPIEN Manipulation Skill Framework, a GPU parallelized robotics simulator and benchmark
https://maniskill.ai/
Apache License 2.0
717 stars 128 forks source link

Run using Singularity on Slurm cluster with headless GPU #265

Open JankowskiChristopher opened 5 months ago

JankowskiChristopher commented 5 months ago

Hi, I am having a question similar to #102 I am trying to run ManiSkill2 on a Slurm cluster with headless GPUs. Because I do not have sudo I cannot copy the recommended json files from the troubleshooting section, therefore I am relying on the official Docker image using Singularity which is available on the cluster. I am trying to run a minimalistic example like this:

srun --partition=common --gres=gpu:1 singularity exec --nv  docker://haosulab/mani-skill2  python -c "import gym; import mani_skill2.envs; env = gym.make('PickCube-v0', obs_mode='state', control_mode='pd_ee_delta_pos', render_camera_cfgs=dict(width=384, height=384))"

Unfortunately I get an error:

[2024-04-07 01:40:03.621] [svulkan2] [error] GLFW error: X11: The DISPLAY environment variable is missing
[2024-04-07 01:40:03.621] [svulkan2] [warning] Continue without GLFW.
[2024-04-07 01:40:03.636] [svulkan2] [error] Some required Vulkan extension is not present. You may not use the renderer to render, however, CPU resources will be still available.
srun: error: steven: task 0: Segmentation fault (core dumped)

Running SIngularity without --nv flag gives me the same results.

Do you know, how can I fix it, or how can I test why it is not working? Maybe I should run this with Singularity somehow differently? I tested this command on Titan V and RTX 2080ti GPUs. In the Singularity container I correctly have all 3 json files you mention in troubleshooting section.

I would like to add, that I am working on a Reinforcement Learning project with states, so I guess I do not need rendering - maybe this is a workaround for Vulkan? I asked our admin to install Vulkan on our cluster, so it is available on the cluster but unfortunately the json files you mention in troubleshooting are not available due to some conflicts with other cluster functionalities - that's why I wanted to rely on Singularity.

JankowskiChristopher commented 5 months ago

Stack trace:

Current thread 0x00007f0a2f60d4c0 (most recent call first):
  File "/opt/conda/lib/python3.8/site-packages/mani_skill2/sensors/camera.py", line 143 in __init__
  File "/opt/conda/lib/python3.8/site-packages/mani_skill2/envs/sapien_env.py", line 407 in _setup_cameras
  File "/opt/conda/lib/python3.8/site-packages/mani_skill2/envs/sapien_env.py", line 360 in reconfigure
  File "/opt/conda/lib/python3.8/site-packages/mani_skill2/envs/sapien_env.py", line 473 in reset
  File "/opt/conda/lib/python3.8/site-packages/mani_skill2/envs/sapien_env.py", line 178 in __init__
  File "/opt/conda/lib/python3.8/site-packages/mani_skill2/envs/pick_and_place/base_env.py", line 27 in __init__
  File "/opt/conda/lib/python3.8/site-packages/mani_skill2/envs/pick_and_place/pick_cube.py", line 22 in __init__
  File "/opt/conda/lib/python3.8/site-packages/mani_skill2/utils/registration.py", line 34 in make
  File "/opt/conda/lib/python3.8/site-packages/mani_skill2/utils/registration.py", line 92 in make
  File "/opt/conda/lib/python3.8/site-packages/gym/envs/registration.py", line 87 in make
  File "/opt/conda/lib/python3.8/site-packages/gym/envs/registration.py", line 129 in make
  File "/opt/conda/lib/python3.8/site-packages/gym/envs/registration.py", line 235 in make
  File "<string>", line 1 in <module>
srun: error: steven: task 0: Segmentation fault (core dumped)
StoneT2000 commented 5 months ago

Unfortunately vulkan is required for ManiSkill 2, although we are trying to remove that requirement in ManiSkill3 so you can just run state based (also means you can't render videos of progress though, you might have to execute the policy on a machine that works)

@fbxiang can better debug this problem.

In the mean time I will try and make another docker container with a beta version of ManiSkill3 (it might include some general fixes to avoid some problems with vulkan) and let you try running that to debug as well. I'll let you know when it is ready

JankowskiChristopher commented 5 months ago

Thanks

fbxiang commented 5 months ago

I think the most common issue is that many servers are installed with "headless drivers". However, those drivers actually do more than "headless" and they disable rendering completely (headless or not). While the best solution is to correct the driver installation, it may or may not be viable depending on who manages the cluster. In the future, we will provide a fallback to CPU renderer.

JankowskiChristopher commented 5 months ago

Great, thanks a lot. In the meantime I will try a different cluster, maybe drivers there will work better.

StoneT2000 commented 4 months ago

Any update on this @JankowskiChristopher ?

JankowskiChristopher commented 4 months ago

Hi, sorry for late reply. Unfortunately I haven't had the opportunity to check on another cluster. I will write as soon as I do it.

StoneT2000 commented 4 months ago

No problem, also worth trying out the new ManiSkill v3 instead of ManiSkill 2 and see if the problems might get resolved