google-deepmind / dm_control

Google DeepMind's software stack for physics-based simulation and Reinforcement Learning environments, using MuJoCo.
Apache License 2.0
3.76k stars 666 forks source link

Segmentation fault when loading suite #70

Closed KeAWang closed 5 years ago

KeAWang commented 5 years ago

I'm getting segmentation fault as soon as I do from dm_control import suite. The difference between my issue and https://github.com/deepmind/dm_control/issues/2 is that:

  1. I'm on Ubuntu 16.04.5 while the other issue is on Ubuntu 14.04
  2. I'm already using libstdc++.so.6
  3. I'm using a headless system with EGL (instead of GLFW)
  4. The segfault happens immediately when I import suite instead of loading an environment

Some other details: I'm on glibc 2.23 and I'm using Python 3.6.8 in a conda environment

The first few frames of a gdb backtrace:

#0  0x00007fffedd3afd7 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.396.37
#1  0x00007fffedffd8f6 in ?? () from /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
#2  0x00007fffedfa0511 in ?? () from /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
#3  0x00007fffedfa165c in ?? () from /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
#4  0x00007fffedfb697a in ?? () from /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
#5  0x00007fffc7dc0ec0 in ffi_call_unix64 () from $HOME/miniconda3/envs/dm_control/lib/python3.6/lib-dynload/../../libffi.so.6
#6  0x00007fffc7dc087d in ffi_call () from $HOME/miniconda3/envs/dm_control/lib/python3.6/lib-dynload/../../libffi.so.6
#7  0x00007fffc7fd6ede in _call_function_pointer (argcount=3, resmem=0x7fffffff5290, restype=<optimized out>, atypes=0x7fffffff5230, avalues=0x7fffffff5260, 
    pProc=0x7ffff3a5a440 <eglInitialize>, flags=4353) from $HOME/miniconda3/envs/dm_control/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
#8  _ctypes_callproc () at <artificial>:1195
#9  0x00007fffc7fd7915 in PyCFuncPtr_call () from $HOME/miniconda3/envs/dm_control/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so

This seems to suggest that it's an issue with libnvidia-glsi.so.396.37 or libEGL_nvidia-so.0

For now, I've reverted to using libosmesa and MUJOCO_GL=osmesa

saran-t commented 5 years ago

Sorry for getting back to this late. Unfortunately, this might well be something that's dependent on the Nvidia driver version. Are you in a position to test it on an older driver? We usually run on driver version 390.87 here.

Also, it's possible that Conda might have an effect too. Would you be able to see if the issue persists outside of a Conda environment?

Sorry I couldn't be of more help immediately, but any additional data point would be really helpful for us here!

willwhitney commented 5 years ago

I've run into what I believe to be the same issue. Stacktrace below. I haven't figured out the logic of why this happens — I've run my code on two seemingly-identical DGX machines in the cluster (both Ubuntu 18.04.1, Driver Version: 410.79) and had it segfault on one but not the other.

 File "/private/home/willwhitney/code/TD3-clones/manipulator_debug4_policy_nameTD3_seed0/main.py", line 12, in <module>
    import dm_control2gym
  File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control2gym/__init__.py", line 3, in <module>
    from dm_control2gym import wrapper
  File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control2gym/wrapper.py", line 2, in <module>
    from dm_control import suite
  File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/suite/__init__.py", line 28, in <module>
    from dm_control.suite import acrobot
  File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/suite/acrobot.py", line 24, in <module>
    from dm_control import mujoco
  File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/mujoco/__init__.py", line 18, in <module>
    from dm_control.mujoco.engine import action_spec
  File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/mujoco/engine.py", line 43, in <module>
    from dm_control import render
  File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/render/__init__.py", line 70, in <module>
    Renderer = import_func()
  File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/render/__init__.py", line 35, in _import_egl
    from dm_control.render.pyopengl.egl_renderer import EGLContext
  File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/render/pyopengl/egl_renderer.py", line 62, in <module>
    EGL_DISPLAY = create_initialized_headless_egl_display()
  File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/dm_control/render/pyopengl/egl_renderer.py", line 54, in create_initialized_headless_egl_display
    initialized = EGL.eglInitialize(display, None, None)
  File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/OpenGL/platform/baseplatform.py", line 402, in __call__
    return self( *args, **named )
  File "/private/home/willwhitney/anaconda3/lib/python3.6/site-packages/OpenGL/error.py", line 232, in glCheckError
    baseOperation = baseOperation,
OpenGL.error.GLError: GLError(
    err = 12290,
    baseOperation = eglInitialize,
    cArguments = (
        <OpenGL._opaque.EGLDisplay_pointer object at 0x7f619f17ebf8>,
        None,
        None,
    ),
    result = 0
)
willwhitney commented 5 years ago

It might be nice to have the DISABLE_MUJOCO_RENDERING flag back as an escape hatch for this kind of problem.

saran-t commented 5 years ago

@willwhitney Thank you for the stack trace.

12290 is 0x3002, which is EGL_BAD_ACCESS (https://www.khronos.org/registry/EGL/api/EGL/egl.h). This isn't documented as an error that is raised by eglInitialize (https://www.khronos.org/registry/EGL/sdk/docs/man/html/eglInitialize.xhtml). Will continue to investigate.

We can reintroduce the render disable flag -- that seems to be an oversight rather than a deliberate decision.

willwhitney commented 5 years ago

The error seems to happen stochastically both for Ubuntu 16.04.4 + Driver Version 396.51 and for Ubuntu 18.04.1 + Driver Version 410.79.

Similarly it happens unpredictably across V100 and GP100 cards.

The only pattern I've been able to see is that the segfault has only happened on jobs dispatched from Slurm, not on interactive jobs (even on the same machine). This could be chance though. LD_LIBRARY_PATH and PATH seem to be the same whether interactive or slurm.

Any other experiments I can run to help understand what's going on?

alimuldal commented 5 years ago

@willwhitney As a workaround are you able to use OSMesa instead of EGL (MUJOCO_GL=osmesa)? Switching to software rendering seems preferable to disabling rendering altogether.

lunar24 commented 5 years ago

@saran-t Hello, I encountered some errors while running EGL rendering. My environment is ubuntu18, NVIDIA driver 390, Python 3.6. I didn't have similar error messages when rendering using glfw. The following is a description of my problem.

Error message
zzyx@zzy-Vostro-5560:~/planet-master$ python3 '/home/zzyx/.local/lib/python3.6/site-packages/dm_control/suite/explore.py' 
Traceback (most recent call last):
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/suite/explore.py", line 23, in 
    from dm_control import suite
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/suite/__init__.py", line 28, in 
    from dm_control.suite import acrobot
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/suite/acrobot.py", line 24, in 
    from dm_control import mujoco
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/mujoco/__init__.py", line 18, in 
    from dm_control.mujoco.engine import action_spec
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/mujoco/engine.py", line 43, in 
    from dm_control import _render
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/_render/__init__.py", line 63, in 
    Renderer = import_func()  # pylint: disable=invalid-name
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/_render/__init__.py", line 34, in _import_egl
    from dm_control._render.pyopengl.egl_renderer import EGLContext
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/_render/pyopengl/egl_renderer.py", line 64, in 
    raise ImportError('Cannot initialize a headless EGL display.')
ImportError: Cannot initialize a headless EGL display.

I tried to call dm-control separately for testing. Running errors are the same. I tried to track where the error occurred. Because I don't know the correct operating value, I'm not sure if what I'm saying is the key to the problem. In this function, create_initialized_headless_egl_display() (/dm_control/_render/pyopengl/egl_renderer.py) My return value is EGL.EGL_NO_DISPLAY. The return value of EGL. eglQueryDevicesEXT () is an empty list.

Further tracing, in this function, EGL.eglQueryDevicesEXT() (/dm_control/_render/pyopengl/egl_ext.py) My num_devices = EGL. EGLint () value is 0.

Finally, thank you for your program's help to my simulation experiment.

willwhitney commented 5 years ago

@alimuldal I'd also had trouble getting OSMesa rendering working on my cluster, but I've got it going now. Slow rendering is definitely better than no rendering, but having a switch to totally bypass these issues seems useful too.

For my purposes not having any rendering for large-scale experiments on the cluster in exchange for not putzing about with dependencies is an OK tradeoff. I'm not going to watch videos from 100 experiments anyway and I can render as needed on a local machine.

denisyarats commented 5 years ago

I'm having the same problem (EGL_BAD_ACCESS, err= 12290) running several 1-gpu jobs on a machine with 8-gpus. I think the reason it is happening is that one job blocks some EGL resource, and the other jobs can't access it. Perhaps, it is because MuJoCo is opening a context on a particular gpu (say GPU:0), instead of have separate contexts for each gpu.

saran-t commented 5 years ago

@1nadequacy Your comment gives me a working hypothesis for this. The multi-GPU machines that we use implement device isolation. This loop here iterates through each device and return the first one where a display can be created. In our setup, a job is guaranteed dedicated access if the display can be created.

One thing you could try is to modify into that loop and force it to skip first n devices (i.e. for display in displays[n:] and see if any of them works. You might have to manually try with different n up to the number of devices present on your machine.

denisyarats commented 5 years ago

Great, I just made a fix like this, and it works!

-      initialized = EGL.eglInitialize(display, None, None)
-      if EGL.eglGetError() == EGL.EGL_SUCCESS and initialized == EGL.EGL_TRUE:
-        break
-      else:
-        display = EGL.EGL_NO_DISPLAY
+      try:
+        initialized = EGL.eglInitialize(display, None, None)
+        if EGL.eglGetError() == EGL.EGL_SUCCESS and initialized == EGL.EGL_TRUE:
+          break
+      except:
+        pass

Let me know if you want me to create a PR?

saran-t commented 5 years ago

Thanks for confirming! I've already submitted a fix in our internal repository. This will be made available when we do our next code push later this week.