Open snowyday opened 5 months ago
Any updates on this?
Temporary fix to add in your bash/terminal to extract integer ID from UUID
# Get the UUID from CUDA_VISIBLE_DEVICES
UUID=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f1)
echo DEVICES $CUDA_VISIBLE_DEVICES
echo UUID $UUID
# Use nvidia-smi to find the corresponding integer ID
ID=$(nvidia-smi --id=$UUID --query-gpu=index --format=csv,noheader)
echo ID $ID
# Set CUDA_VISIBLE_DEVICES to the integer ID
export CUDA_VISIBLE_DEVICES=$ID
It has come to my attention that when setting the
CUDA_VISIBLE_DEVICES
environment variable, there is an expectation for the value to be an integer. However, according to the NVIDIA CUDA documentation (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars),CUDA_VISIBLE_DEVICES
can also accept UUIDs, such asCUDA_VISIBLE_DEVICES=GPU-8932f937
.I encountered an error with the DeepSpeed launcher script when I set
CUDA_VISIBLE_DEVICES
to a UUID value. The relevant code in the launcher script is located at https://github.com/microsoft/DeepSpeed/blob/b112c99ea8e09eb06ada0d60a3687983cb8c4bd0/deepspeed/launcher/runner.py#L295 and the error is as follows:I would like to add some context to the previously mentioned issue regarding the handling of the CUDA_VISIBLE_DEVICES environment variable. In distributed computing environments, such as those managed by OpenPBS, the use of command chgroup may result in CUDA_VISIBLE_DEVICES being populated with UUIDs.
I would like to report this as an issue since it appears that the current script only supports integer values for
CUDA_VISIBLE_DEVICES
, which contradicts the flexibility described in the NVIDIA documentation. This problem has already been observed in a PyTorch context, as seen in a similar issue reported at https://github.com/pytorch/pytorch/issues/90543.It would be beneficial to update the DeepSpeed launcher script to accommodate UUIDs in addition to integers for the
CUDA_VISIBLE_DEVICES
environment variable to ensure compatibility and prevent such errors from occurring.