ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.16k stars 5.61k forks source link

Ray cannot access GPUs under a non-root user (failed access of ray.init() to root-owned `/proc/driver/nvidia/gpus`) #28064

Closed mirekphd closed 10 months ago

mirekphd commented 2 years ago

What happened + What you expected to happen

I wanted to run ray.init() in a Jupyter Notebook under security-hardened Openshift 3.11 (on RHEL 7.x) on a node with GPUs. The python script was previously tested to work fine on a dev server under plain docker (on Centos Stream 8) on a machine without any GPUs (with default docker capabilities but custom UID).

Error message:

2022-08-23 10:20:32,243 ERROR resource_spec.py:193 -- Could not parse gpu information.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/resource_spec.py", line 189, in resolve
    info_string = _get_gpu_info_string()
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/resource_spec.py", line 351, in _get_gpu_info_string
    gpu_dirs = os.listdir(proc_gpus_path)
PermissionError: [Errno 13] Permission denied: '/proc/driver/nvidia/gpus'
2022-08-23 10:20:32,246 WARNING services.py:1882 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2022-08-23 10:20:32,402 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at http://<redacted>:8265 

Expected: Remove any code lines reading from root-owned folders, like /proc/driver/nvidia/gpus at least in:

I'm almost certain can find the info you need using NVIDIA utility nvidia-smi (see available info using -h switch).

Versions / Dependencies

$ pip freeze | grep ray
lightgbm-ray==0.1.5
ray==2.0.0
xgboost-ray==0.1.10

Reproduction script

Run ray.init() as a non-root user on a linux machine with a GPU and its appropriate driver installed.

For example using this code snippet:

import ray

ray_cluster_num_cpus = 32
num_gpus = 1

ray_dashboard_host="0.0.0.0" # external
ray_dashboard_port=8265

ray.shutdown()

ray.init(num_cpus=ray_cluster_num_cpus,
         num_gpus=num_gpus,
         include_dashboard=True,
         dashboard_host=ray_dashboard_host,
         dashboard_port=ray_dashboard_port)

This was reproduced in our GPU-enabled Jupyter Notebook container (mirekphd/ml-gpu-py38-cuda112-cust:latest) under Openshift 3.11 (which runs containers under non-root users with random UIDs):

# we are running as user with high ID:
$ id
uid=1000150000(jovyan) gid=100(users) groups=100(users),1000150000

#... but the folder Ray tries to access (/proc/driver/nvidia/gpus) is root-owned
$ ls -lant /proc/driver/
total 0
dr-xr-xr-x.    2 0 0   0 Aug 23 10:22 nvidia-caps
dr-xr-xr-x.    3 0 0   0 Aug 23 10:22 nvidia-nvlink
dr-xr-xr-x.    3 0 0   0 Aug 23 10:22 nvidia-nvswitch
dr-xr-xr-x.    4 0 0   0 Aug 23 10:22 nvidia-uvm
-r--r--r--.    1 0 0   0 Aug 23 10:22 nvram
-r--r--r--.    1 0 0   0 Aug 23 10:22 rtc
dr-xr-xr-x.    3 0 0 120 Aug 23 09:26 nvidia
dr-xr-xr-x.    7 0 0   0 Aug 23 09:26 .
dr-xr-xr-x. 1684 0 0   0 Aug 23 09:26 ..

Issue Severity

High: It blocks me from completing my task.

xwjiang2010 commented 2 years ago

Thanks. I think there is no hard reason we cannot do that. Just need to do proper parsing as nvidia-smi is human readable prints.

mirekphd commented 2 years ago

As a tangent issue I think that it should be also possible to prevent Ray from making any GPU properties checks using a flag that disables GPUs (sets their desired number to zero). After all, the user may have a crippled / hardened system without nvidia-smi utility available [or in the system path]).

Currently both server startup options [1] do not prevent Ray from trying to access the root-owned folder (/proc/driver/nvidia/gpus), and cause the error [2] to be wrongly reported, because setting the number of GPUs it to 0 does allow Ray servers to start correctly (which I verified using both available server startup methods).

[1] (--num-gpus switch of ray start --head and num_gpus arg. of ray.init() python method) [2] PermissionError: [Errno 13] Permission denied: '/proc/driver/nvidia/gpus'

wuisawesome commented 2 years ago

Hmmm yeah, if num_gpus is hardcoded we shouldn't do any of the autodetection logic, that's not good

mirekphd commented 2 years ago

A closely related issue with root requirement affects the dashboard server access to the / folder, which is not necessarily accessible to the user running Ray Core (unlike ~/, which is much more likely to be accessible).

This prevents Ray Dashboard (in the new UI) from measuring available disk resources, so consequently zeros are currently incorrectly displayed in the Disk (/) and Disk(root) fields due to / being root-owned, and thus seemingly without any free disk space.

mirekphd commented 2 years ago

Fortunately this task of parsing nvidia-smi output has been already accomplished in the GPUtil package (see its Github page), which ray already uses for this very purpose, but it fails silently if the package is not installed, without any warnings and installation recommendations (which probably should be included in requirements.txt and definitely described in the docs):

    if importlib.util.find_spec("GPUtil"):
        gpu_list = GPUtil.getGPUs()
        result = len(gpu_list)
    # SILENTLY FAILS ON MISSING GPUtil... 
    # TODO: USE `else` AND PRINT WARNING AND RECOMMEND GPUtil INSTALLATION 
    elif sys.platform.startswith("linux"):
        # TRIES TO ACCESS ROOT-OWNED FOLDER HERE, THUS PRODUCING ERROR DESCRIBED HERE IN #28064
    elif sys.platform == "win32":
    [..]
    return result

[ _autodetect_num_gpus() ]

More info

Here's more color on how to safely and reliably detect the number of available GPUs, which can be used to improve the existing solution in _autodetect_num_gpus():

>>> import GPUtil

>>> GPUtil.getGPUs()
[<GPUtil.GPUtil.GPU object at 0x7f8ab7907f10>]

>>> gpus = GPUtil.getGPUs()

# a very liberal check, which will always succeed in a physical presence of a GPU, including cards fully utilized by noisy neighbors (with no free VRAM and 100% load)
>>> gpu_avail = GPUtil.getAvailability(gpus, maxLoad=1.0, maxMemory=1.0, includeNan=False, excludeID=[], excludeUUID=[])>>> gpu_avail
[1]

# versus excessively conservative check, which will likely never succeed (there is always some minimal VRAM usage even in headless servers)
>>> gpu_avail = GPUtil.getAvailability(gpus, maxLoad=0.0, maxMemory=0.0, includeNan=False, excludeID=[], excludeUUID=[])
>>> gpu_avail
[0]

Just need to do proper parsing as nvidia-smi is human readable prints.

wuisawesome commented 2 years ago

@mirekphd are you able to help contribute the solution to this? I think we're concerned about putting a small dependency like GPUtil into ray's core dependencies, but vendoring the library seems like a reasonable approach.

mirekphd commented 2 years ago

@wuisawesome sure, I can try a PR later this week. I even started drafting one, but lost momentum, seeing how well GPUtil worked (detecting datacenter GPUs spanning two generations) and realizing it may become re-included as a requirement. But I agree it is rather too old and unmaintained to be included as such.

As a starting point I could try to re-use the existing solution from the UtilMonitor class: https://github.com/ray-project/ray/blob/b6765bb4f36dea63e1769bb15ad001586830379d/python/ray/tune/utils/util.py#L64-L65 https://github.com/ray-project/ray/blob/b6765bb4f36dea63e1769bb15ad001586830379d/python/ray/tune/utils/util.py#L36-L41

XuehaiPan commented 2 years ago

I think we're concerned about putting a small dependency like GPUtil into ray's core dependencies, but vendoring the library seems like a reasonable approach.

This would resolve the issue:

See also: https://github.com/ray-project/ray/issues/17914#issuecomment-1236095136

asm582 commented 1 year ago

+1

mattip commented 1 year ago

Note that gputil fails if nvidia-smi exists and can be run, but the user is not admin or there is no nvidia card. Since ray already depends on gpustat, could we use that instead?

mattip commented 1 year ago

Hmm, it seems gpustat also does not work on windows if the card is not found. wookayin/gpustat#142

mattip commented 1 year ago

fwiw, https://github.com/wookayin/gpustat/issues/142 was solved.

wuisawesome commented 1 year ago

@mattip what's the recommended action item here? update our gpustat version and change the implementation to use that? (also do you have a sense of how difficult this would be?)

mattip commented 1 year ago

The gpustat maintainer has been cooperative and it does not make sense to be using two different libraries in two parts of ray. So I would suggest using gpustat where available. Note gpustat only works with NVIDIA cards at the moment. I would think the implementation would be similar to other code already existing in ray.

wuisawesome commented 1 year ago

That makes sense to me. +1 to doing it (assuming you have the bandwidth, because I don't at the moment :p)

mattip commented 1 year ago

OK

mattip commented 1 year ago

@mirekphd could you confirm that gpustat is already in your ray environment (it should be since it is in the dependencies): python -c "import gpustat"?

Edit: fixed command to check import

AlexanderOllman commented 1 year ago

Ran into this issue. Just needed a note to run pip install GPUtil or add to requirements and this error goes away.

However, this appears to result in my case, in the Ray instance just hanging indefinitely on:

2023-08-25 18:24:03,231 INFO worker.py:1636 -- Started a local Ray instance.

Which I assume is independent of vLLM, so I have posted asked about this in the Ray GitHub.

mattip commented 1 year ago

A better solution to this is PR #35581 (in my opinion, since I made the PR :) ) which removes GPUtil in favor of gpustat which is already used elsewhere in ray. Unfortunately that PR has not been reviewed.

jjyao commented 10 months ago

@XuehaiPan

I'm catching up here. What's your recommended way of auto detecting the number of gpus that works on all platforms: we have GPUtil, gpustat, nvidia-ml-py, and your nvitop/libcuda.py.

wookayin commented 10 months ago

I can think of these two "simplest" ways to get the count of GPUs without depending on a high-level library such as gpustat or nvitop (beaware it's GPLv3):

If you prefer neither of these, you can write a native extension that uses nvml.h and the shared library from the nvidia driver to directly call the C API nvmlDeviceGetCount_v2. No dependencies, but at the cost of all the build hassles.

XuehaiPan commented 10 months ago

What's your recommended way of auto detecting the number of gpus that works on all platforms?

@jjyao This depends on if we are detecting the number of GPUs on the system or the number of CUDA visible devices (respect the CUDA_VISIBLE_DEVICES environment variable). If I understand correctly, we are interested in CUDA devices rather than physical GPU devices.

To detect GPU devices installed on the system

  1. nvidia-smi + subprocess (NB: GPUtil is a wrapper around nvidia-smi):
import subprocess

# May need error handling logic
output = subprocess.check_output(['nvidia-smi', '-L']).decode()
num_gpus = output.count('UUID: GPU-')
num_migs = output.count('UUID: MIG-')
  1. gpustat:
import gpustat

num_gpus = gpustat.gpu_count()
# num_migs not supported
  1. nvidia-ml-py:
import pynvml

# May need error handling logic
pynvml.nvmlInit()
num_gpus = pynvml.nvmlDeviceGetCount()

num_migs = 0
for index in range(num_gpus):
    handle = pynvml.nvmlDeviceGetHandleByIndex(index)
    try:
        max_mig_count = pynvml.nvmlDeviceGetMaxMigDeviceCount(handle)
    except pynvml.NVMLError_NotSupported:
        continue
    for mig_index in range(max_mig_count):
        try:
            mig_handle = pynvml.nvmlDeviceGetMigDeviceHandleByIndex(handle, mig_index)
        except pynvml.NVMLError_NotFound:
            break
        num_migs += 1

pynvml.nvmlShutdown()
  1. nvitop (add a new dependency):
from nvitop import Device, MigDevice

num_gpus = Device.count()
num_migs = MigDevice.count()
  1. Write a C extension with nvml.h and link to libnvidia-ml.so.1 as @wookayin suggested.

To detect CUDA visible devices

  1. nvidia-smi, nvidia-ml-py, gpustat:

    1. Detect the number of GPUs on the system.
    2. Parse the CUDA_VISIBLE_DEVICES environment variable.
    3. Verify that the devices specified in CUDA_VISIBLE_DEVICES are valid.
  2. torch (add a huge dependency):

import torch

# Note that torch also uses `nvidia-ml-py + parsing logic` to detect CUDA visible devices.
# This result may not always be correct.
num_cuda_visible_devices = torch.cuda.device_count()
  1. tensorflow (add a huge dependency):
import tensorflow as tf

num_cuda_visible_devices = len(tf.config.list_physical_devices('GPU'))
  1. jax (add a huge dependency):
import jax

num_cuda_visible_devices = jax.device_count('gpu')
  1. nvitop (add a new dependency):
from nvitop import CudaDevice

num_cuda_visible_devices = CudaDevice.count()

# or
from nvitop import Device

num_cuda_visible_devices = Device.cuda.count()

# or
num_cuda_visible_devices = len(Device.parse_cuda_visible_devices())

nvitop also provides a utility function to parse the CUDA_VISIBLE_DEVICES environment variable:

>>> import os
>>> os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
>>> os.environ['CUDA_VISIBLE_DEVICES'] = '6,5'
>>> parse_cuda_visible_devices()        # parse the `CUDA_VISIBLE_DEVICES` environment variable to NVML indices
[6, 5]

>>> parse_cuda_visible_devices('0,4')   # pass the `CUDA_VISIBLE_DEVICES` value explicitly
[0, 4]

>>> parse_cuda_visible_devices('GPU-18ef14e9,GPU-849d5a8d')  # accept abbreviated UUIDs
[5, 6]

>>> parse_cuda_visible_devices(None)    # get all devices when the `CUDA_VISIBLE_DEVICES` environment variable unset
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> parse_cuda_visible_devices('MIG-d184f67c-c95f-5ef2-a935-195bd0094fbd')           # MIG device support (MIG UUID)
[(0, 0)]
>>> parse_cuda_visible_devices('MIG-GPU-3eb79704-1571-707c-aee8-f43ce747313d/13/0')  # MIG device support (GPU UUID)
[(0, 1)]
>>> parse_cuda_visible_devices('MIG-GPU-3eb79704/13/0')                              # MIG device support (abbreviated GPU UUID)
[(0, 1)]

>>> parse_cuda_visible_devices('')      # empty string
[]
>>> parse_cuda_visible_devices('0,0')   # invalid `CUDA_VISIBLE_DEVICES` (duplicate device ordinal)
[]
>>> parse_cuda_visible_devices('16')    # invalid `CUDA_VISIBLE_DEVICES` (device ordinal out of range)
[]
  1. Write a C extension with cuda.h and link to libcuda.so.1.

I can think of these two "simplest" ways to get the count of GPUs without depending on a high-level library such as gpustat or nvitop (beaware it's GPLv3)

@wookayin About the concern of the nvitop's license. nvitop is released under dual license with Apache-2.0 + GPL-3.0. The CLI part of nvitop is released under the GPL-3.0 license while the API part is released under the Apache-2.0 license. See nvitop: copyright-notice.

import nvitop              # Apache-2.0
from nvitop import *       # Apache-2.0
from nvitop import Device  # Apache-2.0

from nvitop import gui     # GPL-3.0
jjyao commented 10 months ago

If I understand correctly, we are interested in CUDA devices rather than physical GPU devices.

@XuehaiPan, currently Ray uses GPUtil to detect the number of physical GPU devices and also looks at CUDA_VISIBLE_DEVICES, the min is the number of GPU resources available to Ray.

XuehaiPan commented 10 months ago

currently Ray uses GPUtil to detect the number of physical GPU devices and also looks at CUDA_VISIBLE_DEVICES, the min is the number of GPU resources available to Ray.

@jjyao That's the number of CUDA visible devices.

If we only count the number of items in CUDA_VISIBLE_DEVICES, it could lead to wrong results. For example, if you have 8 GPUs on the system but set CUDA_VISIBLE_DEVICES="0,1,2,3,0", you will get 0 CUDA devices rather than 5.

Parsing the CUDA_VISIBLE_DEVICES via NVML APIs is non-trivial. There are many corner cases:

  1. duplicated identifiers
  2. invalid identifiers
  3. mixing integers and UUIDs
  4. UUID abbreviation
  5. MIG device enumeration (by now, at most one MIG device can be used by a CUDA program)
  6. two UUID formats for MIG devices: MIG-GPU-<GPU-UUID>/<GI>/<CI> pre-R470 and MIG-<MIG-UUID> (R470+)
  7. if there are both MIG-enabled and MIG-disabled GPUs on the system, CUDA will enumerate the MIG device first (when CUDA_VISIBLE_DEVICES is unset or set with MIG-enabled GPUs)

    ...

IMO, the best practice is to use the CUDA driver library libcuda or the CUDA Runtime library libcudart to do this.

jjyao commented 10 months ago

If we only count the number of items in CUDA_VISIBLE_DEVICES, it could lead to wrong results. For example, if you have 8 GPUs on the system but set CUDA_VISIBLE_DEVICES="0,1,2,3,0", you will get 0 CUDA devices rather than 5.

@XuehaiPan, why will I get 0 CUDA devices, currently Ray simply does a split(",") with no validations at all.

XuehaiPan commented 10 months ago

If we only count the number of items in CUDA_VISIBLE_DEVICES, it could lead to wrong results. For example, if you have 8 GPUs on the system but set CUDA_VISIBLE_DEVICES="0,1,2,3,0", you will get 0 CUDA devices rather than 5.

@XuehaiPan, why will I get 0 CUDA devices, currently Ray simply does a split(",") with no validations at all.

@jjyao Because there is a duplicate identifier 0 in the CUDA_VISIBLE_DEVICES. That invalid CUDA_VISIBLE_DEVICES variable will cause no CUDA devices for CUDA programs.

As I commented in https://github.com/ray-project/ray/issues/28064#issuecomment-1793365510, simply splitting the comma-separated list is insufficient to validate the environment variable.

CUDA_VISIBLE_DEVICES = os.getenv('CUDA_VISIBLE_DEVICES', ''.join(map(str, range(num_physical_gpus))))
num_cuda_devices = min(num_physical_gpus, len(CUDA_VISIBLE_DEVICES.split(',')))
  1. duplicated identifiers

    • CUDA_VISBLE_DEVICES="0,1,2,3,0" will get 0 CUDA devices rather than 5.
  2. invalid identifiers

    • CUDA_VISBLE_DEVICES="8" will get 0 CUDA devices on an 8 GPU system.
    • CUDA_VISBLE_DEVICES="0,1,8,2,3" will get 2 CUDA devices rather than 5 on an 8 GPU system.
  3. mixing integers and UUIDs

    • CUDA_VISBLE_DEVICES="0,1,GPU-a1b2c3d4-e5f6,3" will get 2 CUDA devices rather than 4.
    • CUDA_VISBLE_DEVICES="GPU-a1b2c3d4-e5f6,0,1,3" will get 1 CUDA device rather than 4.
  4. MIG device enumeration (by now, at most one MIG device can be used by a CUDA program)

    • CUDA_VISBLE_DEVICES="MIG-a1b2c3d4,MIG-e5f6a7b8" will get 1 CUDA device rather than 2.
  5. if there are both MIG-enabled and MIG-disabled GPUs on the system, CUDA will enumerate the MIG device first (when CUDA_VISIBLE_DEVICES is unset or set with MIG-enabled GPUs)


You can verify the above cases via normalize_cuda_visible_devices:

In [1]: from nvitop import normalize_cuda_visible_devices

In [2]: normalize_cuda_visible_devices('0,1,2,3,0')
Out[2]: ''

In [3]: normalize_cuda_visible_devices('8')
Out[3]: ''

In [4]: normalize_cuda_visible_devices('0,1,8,2,3')
Out[4]: 'GPU-d8f503ec-bb34-4304-0053-5d9e62044184,GPU-7758abd0-62e7-a1db-c57e-084f6bc96b11'

In [5]: normalize_cuda_visible_devices('0,1,GPU-1b9855b7,3')
Out[5]: 'GPU-d8f503ec-bb34-4304-0053-5d9e62044184,GPU-7758abd0-62e7-a1db-c57e-084f6bc96b11'

In [6]: normalize_cuda_visible_devices('GPU-1b9855b7,0,1,3')
Out[6]: 'GPU-1b9855b7-640c-7c18-1b53-3d69a2ea51c4'
jjyao commented 10 months ago

@XuehaiPan do you know the difference between https://pypi.org/project/nvidia-ml-py/ and https://pypi.org/project/pynvml/. Which is the official binding provided by Nvidia?

mattip commented 10 months ago

The pynvml package says

As of version 11.0.0, the NVML-wrappers used in pynvml are identical to those published through nvidia-ml-py.

PR #41020 uses pynvml instead of gpustat.

jonathan-anyscale commented 10 months ago

https://github.com/ray-project/ray/pull/41020 is merged which now auto-detect with nvidia-ml lib.