microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.63k stars 4.04k forks source link

[BUG] CUDA_VISIBLE_DEVICES is not parsed correctly #5278

Open snowyday opened 5 months ago

snowyday commented 5 months ago

It has come to my attention that when setting the CUDA_VISIBLE_DEVICES environment variable, there is an expectation for the value to be an integer. However, according to the NVIDIA CUDA documentation (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars), CUDA_VISIBLE_DEVICES can also accept UUIDs, such as CUDA_VISIBLE_DEVICES=GPU-8932f937.

I encountered an error with the DeepSpeed launcher script when I set CUDA_VISIBLE_DEVICES to a UUID value. The relevant code in the launcher script is located at https://github.com/microsoft/DeepSpeed/blob/b112c99ea8e09eb06ada0d60a3687983cb8c4bd0/deepspeed/launcher/runner.py#L295 and the error is as follows:

File "/XXX/deepspeed/launcher/runner.py", line 295, in <listcomp>
    slots = [int(x) for x in slots.split(SLOT_SEP)]
ValueError: invalid literal for int() with base 10: 'GPU-8932f937'

I would like to add some context to the previously mentioned issue regarding the handling of the CUDA_VISIBLE_DEVICES environment variable. In distributed computing environments, such as those managed by OpenPBS, the use of command chgroup may result in CUDA_VISIBLE_DEVICES being populated with UUIDs.

I would like to report this as an issue since it appears that the current script only supports integer values for CUDA_VISIBLE_DEVICES, which contradicts the flexibility described in the NVIDIA documentation. This problem has already been observed in a PyTorch context, as seen in a similar issue reported at https://github.com/pytorch/pytorch/issues/90543.

It would be beneficial to update the DeepSpeed launcher script to accommodate UUIDs in addition to integers for the CUDA_VISIBLE_DEVICES environment variable to ensure compatibility and prevent such errors from occurring.

samihormi commented 1 month ago

Any updates on this?

samihormi commented 1 month ago

Temporary fix to add in your bash/terminal to extract integer ID from UUID

# Get the UUID from CUDA_VISIBLE_DEVICES
UUID=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f1)
echo DEVICES $CUDA_VISIBLE_DEVICES
echo UUID $UUID

# Use nvidia-smi to find the corresponding integer ID
ID=$(nvidia-smi --id=$UUID --query-gpu=index --format=csv,noheader)
echo ID $ID

# Set CUDA_VISIBLE_DEVICES to the integer ID
export CUDA_VISIBLE_DEVICES=$ID