Closed mirekphd closed 10 months ago
Thanks. I think there is no hard reason we cannot do that. Just need to do proper parsing as nvidia-smi
is human readable prints.
As a tangent issue I think that it should be also possible to prevent Ray from making any GPU properties checks using a flag that disables GPUs (sets their desired number to zero). After all, the user may have a crippled / hardened system without nvidia-smi
utility available [or in the system path]).
Currently both server startup options [1] do not prevent Ray from trying to access the root-owned folder (/proc/driver/nvidia/gpus
), and cause the error [2] to be wrongly reported, because setting the number of GPUs it to 0 does allow Ray servers to start correctly (which I verified using both available server startup methods).
[1] (--num-gpus
switch of ray start --head
and num_gpus
arg. of ray.init()
python method)
[2] PermissionError: [Errno 13] Permission denied: '/proc/driver/nvidia/gpus'
Hmmm yeah, if num_gpus
is hardcoded we shouldn't do any of the autodetection logic, that's not good
A closely related issue with root requirement affects the dashboard server access to the /
folder, which is not necessarily accessible to the user running Ray Core (unlike ~/
, which is much more likely to be accessible).
This prevents Ray Dashboard (in the new UI) from measuring available disk resources, so consequently zeros are currently incorrectly displayed in the Disk (/)
and Disk(root)
fields due to /
being root-owned, and thus seemingly without any free disk space.
Fortunately this task of parsing nvidia-smi
output has been already accomplished in the GPUtil
package (see its Github page), which ray
already uses for this very purpose, but it fails silently if the package is not installed, without any warnings and installation recommendations (which probably should be included in requirements.txt and definitely described in the docs):
if importlib.util.find_spec("GPUtil"):
gpu_list = GPUtil.getGPUs()
result = len(gpu_list)
# SILENTLY FAILS ON MISSING GPUtil...
# TODO: USE `else` AND PRINT WARNING AND RECOMMEND GPUtil INSTALLATION
elif sys.platform.startswith("linux"):
# TRIES TO ACCESS ROOT-OWNED FOLDER HERE, THUS PRODUCING ERROR DESCRIBED HERE IN #28064
elif sys.platform == "win32":
[..]
return result
More info
Here's more color on how to safely and reliably detect the number of available GPUs, which can be used to improve the existing solution in _autodetect_num_gpus()
:
>>> import GPUtil
>>> GPUtil.getGPUs()
[<GPUtil.GPUtil.GPU object at 0x7f8ab7907f10>]
>>> gpus = GPUtil.getGPUs()
# a very liberal check, which will always succeed in a physical presence of a GPU, including cards fully utilized by noisy neighbors (with no free VRAM and 100% load)
>>> gpu_avail = GPUtil.getAvailability(gpus, maxLoad=1.0, maxMemory=1.0, includeNan=False, excludeID=[], excludeUUID=[])>>> gpu_avail
[1]
# versus excessively conservative check, which will likely never succeed (there is always some minimal VRAM usage even in headless servers)
>>> gpu_avail = GPUtil.getAvailability(gpus, maxLoad=0.0, maxMemory=0.0, includeNan=False, excludeID=[], excludeUUID=[])
>>> gpu_avail
[0]
Just need to do proper parsing as
nvidia-smi
is human readable prints.
@mirekphd are you able to help contribute the solution to this? I think we're concerned about putting a small dependency like GPUtil into ray's core dependencies, but vendoring the library seems like a reasonable approach.
@wuisawesome sure, I can try a PR later this week. I even started drafting one, but lost momentum, seeing how well GPUtil
worked (detecting datacenter GPUs spanning two generations) and realizing it may become re-included as a requirement. But I agree it is rather too old and unmaintained to be included as such.
As a starting point I could try to re-use the existing solution from the UtilMonitor
class:
https://github.com/ray-project/ray/blob/b6765bb4f36dea63e1769bb15ad001586830379d/python/ray/tune/utils/util.py#L64-L65
https://github.com/ray-project/ray/blob/b6765bb4f36dea63e1769bb15ad001586830379d/python/ray/tune/utils/util.py#L36-L41
I think we're concerned about putting a small dependency like GPUtil into ray's core dependencies, but vendoring the library seems like a reasonable approach.
This would resolve the issue:
See also: https://github.com/ray-project/ray/issues/17914#issuecomment-1236095136
+1
Note that gputil fails if nvidia-smi exists and can be run, but the user is not admin or there is no nvidia card. Since ray already depends on gpustat, could we use that instead?
Hmm, it seems gpustat also does not work on windows if the card is not found. wookayin/gpustat#142
fwiw, https://github.com/wookayin/gpustat/issues/142 was solved.
@mattip what's the recommended action item here? update our gpustat version and change the implementation to use that? (also do you have a sense of how difficult this would be?)
The gpustat maintainer has been cooperative and it does not make sense to be using two different libraries in two parts of ray. So I would suggest using gpustat where available. Note gpustat only works with NVIDIA cards at the moment. I would think the implementation would be similar to other code already existing in ray.
That makes sense to me. +1 to doing it (assuming you have the bandwidth, because I don't at the moment :p)
OK
@mirekphd could you confirm that gpustat
is already in your ray environment (it should be since it is in the dependencies): python -c "import gpustat"
?
Edit: fixed command to check import
Ran into this issue. Just needed a note to run pip install GPUtil
or add to requirements and this error goes away.
However, this appears to result in my case, in the Ray instance just hanging indefinitely on:
2023-08-25 18:24:03,231 INFO worker.py:1636 -- Started a local Ray instance.
Which I assume is independent of vLLM, so I have posted asked about this in the Ray GitHub.
A better solution to this is PR #35581 (in my opinion, since I made the PR :) ) which removes GPUtil in favor of gpustat which is already used elsewhere in ray. Unfortunately that PR has not been reviewed.
@XuehaiPan
I'm catching up here. What's your recommended way of auto detecting the number of gpus that works on all platforms: we have GPUtil, gpustat, nvidia-ml-py, and your nvitop/libcuda.py.
I can think of these two "simplest" ways to get the count of GPUs without depending on a high-level library such as gpustat or nvitop (beaware it's GPLv3):
nvidia-smi -L | wc -l
(or the # of lines from nvidia-smi -L
for windows, though this depends on an external binary). It might be possible that GPU drivers are working but this external tool nvidia-smi
might be just unavailable.pynvml
: pynvml.nvmlInit()
+ pynvml.nvmlDeviceGetCount()
. Depends on an external python library, nvidia-ml-py
.If you prefer neither of these, you can write a native extension that uses nvml.h
and the shared library from the nvidia driver to directly call the C API nvmlDeviceGetCount_v2. No dependencies, but at the cost of all the build hassles.
What's your recommended way of auto detecting the number of gpus that works on all platforms?
@jjyao This depends on if we are detecting the number of GPUs on the system or the number of CUDA visible devices (respect the CUDA_VISIBLE_DEVICES
environment variable). If I understand correctly, we are interested in CUDA devices rather than physical GPU devices.
nvidia-smi
+ subprocess
(NB: GPUtil
is a wrapper around nvidia-smi
):import subprocess
# May need error handling logic
output = subprocess.check_output(['nvidia-smi', '-L']).decode()
num_gpus = output.count('UUID: GPU-')
num_migs = output.count('UUID: MIG-')
gpustat
:import gpustat
num_gpus = gpustat.gpu_count()
# num_migs not supported
nvidia-ml-py
:import pynvml
# May need error handling logic
pynvml.nvmlInit()
num_gpus = pynvml.nvmlDeviceGetCount()
num_migs = 0
for index in range(num_gpus):
handle = pynvml.nvmlDeviceGetHandleByIndex(index)
try:
max_mig_count = pynvml.nvmlDeviceGetMaxMigDeviceCount(handle)
except pynvml.NVMLError_NotSupported:
continue
for mig_index in range(max_mig_count):
try:
mig_handle = pynvml.nvmlDeviceGetMigDeviceHandleByIndex(handle, mig_index)
except pynvml.NVMLError_NotFound:
break
num_migs += 1
pynvml.nvmlShutdown()
nvitop
(add a new dependency):from nvitop import Device, MigDevice
num_gpus = Device.count()
num_migs = MigDevice.count()
nvml.h
and link to libnvidia-ml.so.1
as @wookayin suggested.nvidia-smi
, nvidia-ml-py
, gpustat
:
CUDA_VISIBLE_DEVICES
environment variable.CUDA_VISIBLE_DEVICES
are valid.torch
(add a huge dependency):
import torch
# Note that torch also uses `nvidia-ml-py + parsing logic` to detect CUDA visible devices.
# This result may not always be correct.
num_cuda_visible_devices = torch.cuda.device_count()
tensorflow
(add a huge dependency):import tensorflow as tf
num_cuda_visible_devices = len(tf.config.list_physical_devices('GPU'))
jax
(add a huge dependency):import jax
num_cuda_visible_devices = jax.device_count('gpu')
nvitop
(add a new dependency):from nvitop import CudaDevice
num_cuda_visible_devices = CudaDevice.count()
# or
from nvitop import Device
num_cuda_visible_devices = Device.cuda.count()
# or
num_cuda_visible_devices = len(Device.parse_cuda_visible_devices())
nvitop
also provides a utility function to parse the CUDA_VISIBLE_DEVICES
environment variable:
>>> import os
>>> os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
>>> os.environ['CUDA_VISIBLE_DEVICES'] = '6,5'
>>> parse_cuda_visible_devices() # parse the `CUDA_VISIBLE_DEVICES` environment variable to NVML indices
[6, 5]
>>> parse_cuda_visible_devices('0,4') # pass the `CUDA_VISIBLE_DEVICES` value explicitly
[0, 4]
>>> parse_cuda_visible_devices('GPU-18ef14e9,GPU-849d5a8d') # accept abbreviated UUIDs
[5, 6]
>>> parse_cuda_visible_devices(None) # get all devices when the `CUDA_VISIBLE_DEVICES` environment variable unset
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> parse_cuda_visible_devices('MIG-d184f67c-c95f-5ef2-a935-195bd0094fbd') # MIG device support (MIG UUID)
[(0, 0)]
>>> parse_cuda_visible_devices('MIG-GPU-3eb79704-1571-707c-aee8-f43ce747313d/13/0') # MIG device support (GPU UUID)
[(0, 1)]
>>> parse_cuda_visible_devices('MIG-GPU-3eb79704/13/0') # MIG device support (abbreviated GPU UUID)
[(0, 1)]
>>> parse_cuda_visible_devices('') # empty string
[]
>>> parse_cuda_visible_devices('0,0') # invalid `CUDA_VISIBLE_DEVICES` (duplicate device ordinal)
[]
>>> parse_cuda_visible_devices('16') # invalid `CUDA_VISIBLE_DEVICES` (device ordinal out of range)
[]
cuda.h
and link to libcuda.so.1
.I can think of these two "simplest" ways to get the count of GPUs without depending on a high-level library such as gpustat or nvitop (beaware it's GPLv3)
@wookayin About the concern of the nvitop
's license. nvitop
is released under dual license with Apache-2.0 + GPL-3.0
. The CLI part of nvitop
is released under the GPL-3.0 license while the API part is released under the Apache-2.0 license. See nvitop: copyright-notice.
import nvitop # Apache-2.0
from nvitop import * # Apache-2.0
from nvitop import Device # Apache-2.0
from nvitop import gui # GPL-3.0
If I understand correctly, we are interested in CUDA devices rather than physical GPU devices.
@XuehaiPan, currently Ray uses GPUtil to detect the number of physical GPU devices and also looks at CUDA_VISIBLE_DEVICES, the min is the number of GPU resources available to Ray.
currently Ray uses GPUtil to detect the number of physical GPU devices and also looks at CUDA_VISIBLE_DEVICES, the min is the number of GPU resources available to Ray.
@jjyao That's the number of CUDA visible devices.
If we only count the number of items in CUDA_VISIBLE_DEVICES
, it could lead to wrong results. For example, if you have 8 GPUs on the system but set CUDA_VISIBLE_DEVICES="0,1,2,3,0"
, you will get 0 CUDA devices rather than 5.
Parsing the CUDA_VISIBLE_DEVICES
via NVML APIs is non-trivial. There are many corner cases:
MIG-GPU-<GPU-UUID>/<GI>/<CI>
pre-R470 and MIG-<MIG-UUID>
(R470+)if there are both MIG-enabled and MIG-disabled GPUs on the system, CUDA will enumerate the MIG device first (when CUDA_VISIBLE_DEVICES
is unset or set with MIG-enabled GPUs)
...
IMO, the best practice is to use the CUDA driver library libcuda
or the CUDA Runtime library libcudart
to do this.
If we only count the number of items in CUDA_VISIBLE_DEVICES, it could lead to wrong results. For example, if you have 8 GPUs on the system but set CUDA_VISIBLE_DEVICES="0,1,2,3,0", you will get 0 CUDA devices rather than 5.
@XuehaiPan, why will I get 0 CUDA devices, currently Ray simply does a split(",")
with no validations at all.
If we only count the number of items in CUDA_VISIBLE_DEVICES, it could lead to wrong results. For example, if you have 8 GPUs on the system but set CUDA_VISIBLE_DEVICES="0,1,2,3,0", you will get 0 CUDA devices rather than 5.
@XuehaiPan, why will I get 0 CUDA devices, currently Ray simply does a
split(",")
with no validations at all.
@jjyao Because there is a duplicate identifier 0 in the CUDA_VISIBLE_DEVICES
. That invalid CUDA_VISIBLE_DEVICES
variable will cause no CUDA devices for CUDA programs.
As I commented in https://github.com/ray-project/ray/issues/28064#issuecomment-1793365510, simply splitting the comma-separated list is insufficient to validate the environment variable.
CUDA_VISIBLE_DEVICES = os.getenv('CUDA_VISIBLE_DEVICES', ''.join(map(str, range(num_physical_gpus))))
num_cuda_devices = min(num_physical_gpus, len(CUDA_VISIBLE_DEVICES.split(',')))
duplicated identifiers
CUDA_VISBLE_DEVICES="0,1,2,3,0"
will get 0 CUDA devices rather than 5.invalid identifiers
CUDA_VISBLE_DEVICES="8"
will get 0 CUDA devices on an 8 GPU system.CUDA_VISBLE_DEVICES="0,1,8,2,3"
will get 2 CUDA devices rather than 5 on an 8 GPU system.mixing integers and UUIDs
CUDA_VISBLE_DEVICES="0,1,GPU-a1b2c3d4-e5f6,3"
will get 2 CUDA devices rather than 4.CUDA_VISBLE_DEVICES="GPU-a1b2c3d4-e5f6,0,1,3"
will get 1 CUDA device rather than 4.MIG device enumeration (by now, at most one MIG device can be used by a CUDA program)
CUDA_VISBLE_DEVICES="MIG-a1b2c3d4,MIG-e5f6a7b8"
will get 1 CUDA device rather than 2.if there are both MIG-enabled and MIG-disabled GPUs on the system, CUDA will enumerate the MIG device first (when CUDA_VISIBLE_DEVICES
is unset or set with MIG-enabled GPUs)
You can verify the above cases via normalize_cuda_visible_devices:
In [1]: from nvitop import normalize_cuda_visible_devices
In [2]: normalize_cuda_visible_devices('0,1,2,3,0')
Out[2]: ''
In [3]: normalize_cuda_visible_devices('8')
Out[3]: ''
In [4]: normalize_cuda_visible_devices('0,1,8,2,3')
Out[4]: 'GPU-d8f503ec-bb34-4304-0053-5d9e62044184,GPU-7758abd0-62e7-a1db-c57e-084f6bc96b11'
In [5]: normalize_cuda_visible_devices('0,1,GPU-1b9855b7,3')
Out[5]: 'GPU-d8f503ec-bb34-4304-0053-5d9e62044184,GPU-7758abd0-62e7-a1db-c57e-084f6bc96b11'
In [6]: normalize_cuda_visible_devices('GPU-1b9855b7,0,1,3')
Out[6]: 'GPU-1b9855b7-640c-7c18-1b53-3d69a2ea51c4'
@XuehaiPan do you know the difference between https://pypi.org/project/nvidia-ml-py/ and https://pypi.org/project/pynvml/. Which is the official binding provided by Nvidia?
The pynvml package says
As of version 11.0.0, the NVML-wrappers used in pynvml are identical to those published through nvidia-ml-py.
PR #41020 uses pynvml instead of gpustat.
https://github.com/ray-project/ray/pull/41020 is merged which now auto-detect with nvidia-ml
lib.
What happened + What you expected to happen
I wanted to run
ray.init()
in a Jupyter Notebook under security-hardened Openshift 3.11 (on RHEL 7.x) on a node with GPUs. The python script was previously tested to work fine on a dev server under plain docker (on Centos Stream 8) on a machine without any GPUs (with default docker capabilities but custom UID).Error message:
Expected: Remove any code lines reading from root-owned folders, like
/proc/driver/nvidia/gpus
at least in:_get_gpu_info_string
_autodetect_num_gpus
I'm almost certain can find the info you need using NVIDIA utility
nvidia-smi
(see available info using-h
switch).Versions / Dependencies
Reproduction script
Run
ray.init()
as a non-root user on a linux machine with a GPU and its appropriate driver installed.For example using this code snippet:
This was reproduced in our GPU-enabled Jupyter Notebook container (
mirekphd/ml-gpu-py38-cuda112-cust:latest
) under Openshift 3.11 (which runs containers under non-root users with random UIDs):Issue Severity
High: It blocks me from completing my task.