Closed KeAWang closed 5 years ago
Sorry for getting back to this late. Unfortunately, this might well be something that's dependent on the Nvidia driver version. Are you in a position to test it on an older driver? We usually run on driver version 390.87 here.
Also, it's possible that Conda might have an effect too. Would you be able to see if the issue persists outside of a Conda environment?
Sorry I couldn't be of more help immediately, but any additional data point would be really helpful for us here!
I've run into what I believe to be the same issue. Stacktrace below. I haven't figured out the logic of why this happens — I've run my code on two seemingly-identical DGX machines in the cluster (both Ubuntu 18.04.1, Driver Version: 410.79) and had it segfault on one but not the other.
File "/private/home/willwhitney/code/TD3-clones/manipulator_debug4_policy_nameTD3_seed0/main.py", line 12, in <module>
import dm_control2gym
File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control2gym/__init__.py", line 3, in <module>
from dm_control2gym import wrapper
File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control2gym/wrapper.py", line 2, in <module>
from dm_control import suite
File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/suite/__init__.py", line 28, in <module>
from dm_control.suite import acrobot
File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/suite/acrobot.py", line 24, in <module>
from dm_control import mujoco
File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/mujoco/__init__.py", line 18, in <module>
from dm_control.mujoco.engine import action_spec
File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/mujoco/engine.py", line 43, in <module>
from dm_control import render
File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/render/__init__.py", line 70, in <module>
Renderer = import_func()
File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/render/__init__.py", line 35, in _import_egl
from dm_control.render.pyopengl.egl_renderer import EGLContext
File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/render/pyopengl/egl_renderer.py", line 62, in <module>
EGL_DISPLAY = create_initialized_headless_egl_display()
File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/render/pyopengl/egl_renderer.py", line 54, in create_initialized_headless_egl_display
initialized = EGL.eglInitialize(display, None, None)
File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/OpenGL/platform/baseplatform.py", line 402, in __call__
return self( *args, **named )
File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/OpenGL/error.py", line 232, in glCheckError
baseOperation = baseOperation,
OpenGL.error.GLError: GLError(
err = 12290,
baseOperation = eglInitialize,
cArguments = (
<OpenGL._opaque.EGLDisplay_pointer object at 0x7f619f17ebf8>,
None,
None,
),
result = 0
)
It might be nice to have the DISABLE_MUJOCO_RENDERING
flag back as an escape hatch for this kind of problem.
@willwhitney Thank you for the stack trace.
12290 is 0x3002, which is EGL_BAD_ACCESS
(https://www.khronos.org/registry/EGL/api/EGL/egl.h). This isn't documented as an error that is raised by eglInitialize
(https://www.khronos.org/registry/EGL/sdk/docs/man/html/eglInitialize.xhtml). Will continue to investigate.
We can reintroduce the render disable flag -- that seems to be an oversight rather than a deliberate decision.
The error seems to happen stochastically both for Ubuntu 16.04.4 + Driver Version 396.51 and for Ubuntu 18.04.1 + Driver Version 410.79.
Similarly it happens unpredictably across V100 and GP100 cards.
The only pattern I've been able to see is that the segfault has only happened on jobs dispatched from Slurm, not on interactive jobs (even on the same machine). This could be chance though. LD_LIBRARY_PATH
and PATH
seem to be the same whether interactive or slurm.
Any other experiments I can run to help understand what's going on?
@willwhitney As a workaround are you able to use OSMesa instead of EGL (MUJOCO_GL=osmesa
)? Switching to software rendering seems preferable to disabling rendering altogether.
@saran-t Hello, I encountered some errors while running EGL rendering. My environment is ubuntu18, NVIDIA driver 390, Python 3.6. I didn't have similar error messages when rendering using glfw. The following is a description of my problem.
zzyx@zzy-Vostro-5560:~/planet-master$ python3 '/home/zzyx/.local/lib/python3.6/site-packages/dm_control/suite/explore.py' Traceback (most recent call last): File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/suite/explore.py", line 23, infrom dm_control import suite File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/suite/__init__.py", line 28, in from dm_control.suite import acrobot File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/suite/acrobot.py", line 24, in from dm_control import mujoco File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/mujoco/__init__.py", line 18, in from dm_control.mujoco.engine import action_spec File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/mujoco/engine.py", line 43, in from dm_control import _render File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/_render/__init__.py", line 63, in Renderer = import_func() # pylint: disable=invalid-name File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/_render/__init__.py", line 34, in _import_egl from dm_control._render.pyopengl.egl_renderer import EGLContext File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/_render/pyopengl/egl_renderer.py", line 64, in raise ImportError('Cannot initialize a headless EGL display.') ImportError: Cannot initialize a headless EGL display.
I tried to call dm-control separately for testing. Running errors are the same.
I tried to track where the error occurred. Because I don't know the correct operating value, I'm not sure if what I'm saying is the key to the problem.
In this function,
create_initialized_headless_egl_display()
(/dm_control/_render/pyopengl/egl_renderer.py)
My return value is EGL.EGL_NO_DISPLAY.
The return value of EGL. eglQueryDevicesEXT () is an empty list.
Further tracing, in this function,
EGL.eglQueryDevicesEXT()
(/dm_control/_render/pyopengl/egl_ext.py)
My num_devices = EGL. EGLint () value is 0.
Finally, thank you for your program's help to my simulation experiment.
@alimuldal I'd also had trouble getting OSMesa rendering working on my cluster, but I've got it going now. Slow rendering is definitely better than no rendering, but having a switch to totally bypass these issues seems useful too.
For my purposes not having any rendering for large-scale experiments on the cluster in exchange for not putzing about with dependencies is an OK tradeoff. I'm not going to watch videos from 100 experiments anyway and I can render as needed on a local machine.
I'm having the same problem (EGL_BAD_ACCESS, err= 12290) running several 1-gpu jobs on a machine with 8-gpus. I think the reason it is happening is that one job blocks some EGL resource, and the other jobs can't access it. Perhaps, it is because MuJoCo is opening a context on a particular gpu (say GPU:0), instead of have separate contexts for each gpu.
@1nadequacy Your comment gives me a working hypothesis for this. The multi-GPU machines that we use implement device isolation. This loop here iterates through each device and return the first one where a display can be created. In our setup, a job is guaranteed dedicated access if the display can be created.
One thing you could try is to modify into that loop and force it to skip first n devices (i.e. for display in displays[n:]
and see if any of them works. You might have to manually try with different n
up to the number of devices present on your machine.
Great, I just made a fix like this, and it works!
- initialized = EGL.eglInitialize(display, None, None)
- if EGL.eglGetError() == EGL.EGL_SUCCESS and initialized == EGL.EGL_TRUE:
- break
- else:
- display = EGL.EGL_NO_DISPLAY
+ try:
+ initialized = EGL.eglInitialize(display, None, None)
+ if EGL.eglGetError() == EGL.EGL_SUCCESS and initialized == EGL.EGL_TRUE:
+ break
+ except:
+ pass
Let me know if you want me to create a PR?
Thanks for confirming! I've already submitted a fix in our internal repository. This will be made available when we do our next code push later this week.
I'm getting segmentation fault as soon as I do
from dm_control import suite
. The difference between my issue and https://github.com/deepmind/dm_control/issues/2 is that:Some other details: I'm on glibc 2.23 and I'm using Python 3.6.8 in a conda environment
The first few frames of a gdb backtrace:
This seems to suggest that it's an issue with libnvidia-glsi.so.396.37 or libEGL_nvidia-so.0
For now, I've reverted to using libosmesa and MUJOCO_GL=osmesa