Open JankowskiChristopher opened 5 months ago
Stack trace:
Current thread 0x00007f0a2f60d4c0 (most recent call first):
File "/opt/conda/lib/python3.8/site-packages/mani_skill2/sensors/camera.py", line 143 in __init__
File "/opt/conda/lib/python3.8/site-packages/mani_skill2/envs/sapien_env.py", line 407 in _setup_cameras
File "/opt/conda/lib/python3.8/site-packages/mani_skill2/envs/sapien_env.py", line 360 in reconfigure
File "/opt/conda/lib/python3.8/site-packages/mani_skill2/envs/sapien_env.py", line 473 in reset
File "/opt/conda/lib/python3.8/site-packages/mani_skill2/envs/sapien_env.py", line 178 in __init__
File "/opt/conda/lib/python3.8/site-packages/mani_skill2/envs/pick_and_place/base_env.py", line 27 in __init__
File "/opt/conda/lib/python3.8/site-packages/mani_skill2/envs/pick_and_place/pick_cube.py", line 22 in __init__
File "/opt/conda/lib/python3.8/site-packages/mani_skill2/utils/registration.py", line 34 in make
File "/opt/conda/lib/python3.8/site-packages/mani_skill2/utils/registration.py", line 92 in make
File "/opt/conda/lib/python3.8/site-packages/gym/envs/registration.py", line 87 in make
File "/opt/conda/lib/python3.8/site-packages/gym/envs/registration.py", line 129 in make
File "/opt/conda/lib/python3.8/site-packages/gym/envs/registration.py", line 235 in make
File "<string>", line 1 in <module>
srun: error: steven: task 0: Segmentation fault (core dumped)
Unfortunately vulkan is required for ManiSkill 2, although we are trying to remove that requirement in ManiSkill3 so you can just run state based (also means you can't render videos of progress though, you might have to execute the policy on a machine that works)
@fbxiang can better debug this problem.
In the mean time I will try and make another docker container with a beta version of ManiSkill3 (it might include some general fixes to avoid some problems with vulkan) and let you try running that to debug as well. I'll let you know when it is ready
Thanks
I think the most common issue is that many servers are installed with "headless drivers". However, those drivers actually do more than "headless" and they disable rendering completely (headless or not). While the best solution is to correct the driver installation, it may or may not be viable depending on who manages the cluster. In the future, we will provide a fallback to CPU renderer.
Great, thanks a lot. In the meantime I will try a different cluster, maybe drivers there will work better.
Any update on this @JankowskiChristopher ?
Hi, sorry for late reply. Unfortunately I haven't had the opportunity to check on another cluster. I will write as soon as I do it.
No problem, also worth trying out the new ManiSkill v3 instead of ManiSkill 2 and see if the problems might get resolved
Hi, I am having a question similar to #102 I am trying to run ManiSkill2 on a Slurm cluster with headless GPUs. Because I do not have sudo I cannot copy the recommended json files from the troubleshooting section, therefore I am relying on the official Docker image using Singularity which is available on the cluster. I am trying to run a minimalistic example like this:
Unfortunately I get an error:
Running SIngularity without
--nv
flag gives me the same results.Do you know, how can I fix it, or how can I test why it is not working? Maybe I should run this with Singularity somehow differently? I tested this command on Titan V and RTX 2080ti GPUs. In the Singularity container I correctly have all 3 json files you mention in
troubleshooting
section.I would like to add, that I am working on a Reinforcement Learning project with states, so I guess I do not need rendering - maybe this is a workaround for Vulkan? I asked our admin to install Vulkan on our cluster, so it is available on the cluster but unfortunately the json files you mention in
troubleshooting
are not available due to some conflicts with other cluster functionalities - that's why I wanted to rely on Singularity.