Open geyang opened 3 years ago
I'm not totally sure what this has to do with dm_control
, since we don't reference CUDA_VISIBLE_DEVICES
anywhere in our code - is this a local modification you've made?
Hey Alistair! it is great to hear back from you!
When I request a single gpu via gres=gpu:volta:1
on slurm, only one device is available if we inspect via nvidia-smi
, which is why I set CUDA_VISIBLE_DEVICES=0
in my run script.
The same error actually arrises from master of dm_control code base: I was on an older version.
site-packages/dm_control/dm_control/_render/pyopengl/egl_renderer.py", line 52, in create_initialized_headless_egl_display
else:
device_idx = int(selected_device)
if not 0 <= device_idx < len(all_devices):
raise RuntimeError(
f'EGL_DEVICE_ID must be an integer between 0 and '
f'{len(all_devices) - 1} (inclusive), got {device_idx}.')
candidates = all_devices[device_idx:device_idx + 1]
The relevant code is here: https://github.com/deepmind/dm_control/blob/master/dm_control/_render/pyopengl/egl_renderer.py#L50
def create_initialized_headless_egl_display():
"""Creates an initialized EGL display directly on a device."""
all_devices = EGL.eglQueryDevicesEXT()
selected_device = os.environ.get('EGL_DEVICE_ID', None)
if selected_device is None:
candidates = all_devices
else:
device_idx = int(selected_device)
if not 0 <= device_idx < len(all_devices):
raise RuntimeError(
f'EGL_DEVICE_ID must be an integer between 0 and '
f'{len(all_devices) - 1} (inclusive), got {device_idx}.')
candidates = all_devices[device_idx:device_idx + 1]
The reasons given to us by the MIT supercloud admin is that they have reasons to use non-integer device IDs, because the device ID changes during the same job when a single node shared between jobs. I have attached their response above.
This is not something we can change as users, and they seem to provide good reasons. so we are trying to figure out if there is anything that can be done that removes the requirement that device IDs being integers.
I see. Is there a reason you need to specify a particular device ID to use for rendering? The default behaviour is to use the first device that can be successfully initialised.
As you can see from the code linked above, we use eglQueryDevicesEXT
to enumerate the available devices. I'm not aware of any API methods that would allow us to obtain a display device by UUID, so I'm not sure what we can do about this from our end.
Hi Alastair, let me investigate a bit, will get back to you!
I am encountering an error with deploying
dm_control
in a managed HPC environment. Our admin decided to useUUID
for the device names, which causesdm_control
(andmujoco-py
) to raise error when parsing the available devices:The reasoning behind this device
uuid
is explained in the following email (and issue)