Support selecting an `EGL_DEVICE` by UUID rather than by index

geyang commented 3 years ago

I am encountering an error with deploying dm_control in a managed HPC environment. Our admin decided to use UUID for the device names, which causes dm_control (and mujoco-py) to raise error when parsing the available devices:

Traceback (most recent call last):
  File "/Users/ge/mit/dmc_gen/dmc_gen_analysis/__init__.py", line 164, in thunk
  File "/home/gridsan/geyang/jaynes-mount/dmc_gen/2021-03-05/085031.707344/dmc_gen/dmc_gen/train.py", line 58, in train
    image_size=image_size,
  File "/home/gridsan/geyang/jaynes-mount/dmc_gen/2021-03-05/085031.707344/dmc_gen/dmc_gen/wrappers.py", line 28, in make_env
    frame_skip=action_repeat
  File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dmc2gym/dmc2gym/__init__.py", line 55, in make
    return gym.make(env_id)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 145, in make
    return registry.make(id, **kwargs)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 90, in make
    env = spec.make(**kwargs)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 59, in make
    cls = load(self.entry_point)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 18, in load
    mod = importlib.import_module(mod_name)
  File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dmc2gym/dmc2gym/wrappers.py", line 2, in <module>
    from dm_control import suite
  File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/suite/__init__.py", line 28, in <module>
    from dm_control.suite import acrobot
  File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/suite/acrobot.py", line 24, in <module>
    from dm_control import mujoco
  File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/mujoco/__init__.py", line 18, in <module>
    from dm_control.mujoco.engine import action_spec
  File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/mujoco/engine.py", line 44, in <module>
    from dm_control import _render
  File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/_render/__init__.py", line 67, in <module>
    Renderer = import_func()  # pylint: disable=invalid-name
  File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/_render/__init__.py", line 36, in _import_egl
    from dm_control._render.pyopengl.egl_renderer import EGLContext
  File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/_render/pyopengl/egl_renderer.py", line 69, in <module>
    EGL_DISPLAY = create_initialized_headless_egl_display()
  File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/_render/pyopengl/egl_renderer.py", line 51, in create_initialized_headless_egl_display
    devices = [devices[int(os.environ["CUDA_VISIBLE_DEVICES"])]]
ValueError: invalid literal for int() with base 10: 'GPU-a15dc796-f172-2e06-2283-cea8159bf118'

The reasoning behind this device uuid is explained in the following email (and issue)

FYI, this is a known error where dm_control assumes CUDA_VISIBLE_DEVICES is an integer. We’re using NVIDIA’s UUID API to set the device names to the UUID, rather than the default. The problem with the default naming scheme (0,1,etc) is that it is not consistent. What’s listed as GPU 0 might change even within a job, which you can imagine would cause major problems if you have two people on a node, each allocated one GPU. This sort of alludes to what I’m talking about, but doesn’t get into using the UUID’s instead: https://stackoverflow.com/questions/26123252/inconsistency-of-ids-between-nvidia-smi-l-and-cudevicegetname. It’s a big oversight on Ray’s part to assume that the GPU names are integers, both Tensorflow and Pytorch don’t seem to have a problem with it. I think what they need to understand is that in a shared environment you have to make sure people use only the GPU that’s been allocated to them, and the way to do that is to use the UUID.

alimuldal commented 3 years ago

I'm not totally sure what this has to do with dm_control, since we don't reference CUDA_VISIBLE_DEVICES anywhere in our code - is this a local modification you've made?

geyang commented 3 years ago

Hey Alistair! it is great to hear back from you!

When I request a single gpu via gres=gpu:volta:1 on slurm, only one device is available if we inspect via nvidia-smi, which is why I set CUDA_VISIBLE_DEVICES=0 in my run script.

The same error actually arrises from master of dm_control code base: I was on an older version.

site-packages/dm_control/dm_control/_render/pyopengl/egl_renderer.py", line 52, in create_initialized_headless_egl_display
  else:
    device_idx = int(selected_device)
    if not 0 <= device_idx < len(all_devices):
      raise RuntimeError(
          f'EGL_DEVICE_ID must be an integer between 0 and '
          f'{len(all_devices) - 1} (inclusive), got {device_idx}.')
    candidates = all_devices[device_idx:device_idx + 1]

The relevant code is here: https://github.com/deepmind/dm_control/blob/master/dm_control/_render/pyopengl/egl_renderer.py#L50

def create_initialized_headless_egl_display():
  """Creates an initialized EGL display directly on a device."""
  all_devices = EGL.eglQueryDevicesEXT()
  selected_device = os.environ.get('EGL_DEVICE_ID', None)
  if selected_device is None:
    candidates = all_devices
  else:
    device_idx = int(selected_device)
    if not 0 <= device_idx < len(all_devices):
      raise RuntimeError(
          f'EGL_DEVICE_ID must be an integer between 0 and '
          f'{len(all_devices) - 1} (inclusive), got {device_idx}.')
    candidates = all_devices[device_idx:device_idx + 1]

The reasons given to us by the MIT supercloud admin is that they have reasons to use non-integer device IDs, because the device ID changes during the same job when a single node shared between jobs. I have attached their response above.

This is not something we can change as users, and they seem to provide good reasons. so we are trying to figure out if there is anything that can be done that removes the requirement that device IDs being integers.

alimuldal commented 3 years ago

I see. Is there a reason you need to specify a particular device ID to use for rendering? The default behaviour is to use the first device that can be successfully initialised.

As you can see from the code linked above, we use eglQueryDevicesEXT to enumerate the available devices. I'm not aware of any API methods that would allow us to obtain a display device by UUID, so I'm not sure what we can do about this from our end.

geyang commented 3 years ago

Hi Alastair, let me investigate a bit, will get back to you!

google-deepmind / dm_control

Support selecting an `EGL_DEVICE` by UUID rather than by index #175