ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.93k stars 5.57k forks source link

[<Ray component: Core>] num_gpus not working with ROCM devices #46563

Open erichsu0527 opened 1 month ago

erichsu0527 commented 1 month ago

What happened + What you expected to happen

I'm testing num_gpus feature on a 4 MI210 machine. When running test_amd.py with ROCR_VISIBLE_DEVICES=0,1 and num_gpus=1 on task and actors, task/actor expect to each see one 1 GPU. However, the actor gets ROCR_VISIBLE_DEVICES=0,1, torch expect to see 4 devices and crashed trying to get device property.

When running same code on Nvidia machine(change ROCR_VISIBLE_DEVICES to CUDA_VISIBLE_DEVICES), the actor and task see 1 GPU as expected.

Output from AMD machine:

(GPUActor pid=3134225) ##### actor
(GPUActor pid=3134225) GPU IDs: ['0']
(GPUActor pid=3134225) ROCR_VISIBLE_DEVICES: 0,1
(GPUActor pid=3134225) Number of GPUs: 4
(GPUActor pid=3134225) Device 0: _CudaDeviceProperties(name='AMD Instinct MI210', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
(GPUActor pid=3134225) Device 1: _CudaDeviceProperties(name='AMD Instinct MI210', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
Traceback (most recent call last):
  File "/media/disk1/eric/llm-server/llm-server/a.py", line 111, in <module>
    ray.get(gpu_actor.ping.remote())
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/ray/_private/worker.py", line 2639, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/ray/_private/worker.py", line 864, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::GPUActor.ping() (pid=3134225, ip=192.168.112.24, actor_id=c063fc0f56659c29bc775b3d01000000, repr=<a.GPUActor object at 0x742d1033dcd0>)
  File "/media/disk1/eric/llm-server/llm-server/a.py", line 92, in ping
    print(f"Device {i}: {torch.cuda.get_device_properties(i)}")
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/cuda/__init__.py", line 466, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/hip/HIPContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=

Output from Nvidia machine

(GPUActor pid=534668) ##### actor
(GPUActor pid=534668) GPU IDs: ['0']
(GPUActor pid=534668) CUDA_VISIBLE_DEVICES: 0
(GPUActor pid=534668) Number of GPUs: 1
(GPUActor pid=534668) Device 0: _CudaDeviceProperties(name='NVIDIA A100 80GB PCIe', major=8, minor=0, total_memory=81050MB, multi_processor_count=108)
(GPUActor pid=534668) #####
(gpu_task pid=534840) ##### task
(gpu_task pid=534840) GPU IDs: ['1']
(gpu_task pid=534840) CUDA_VISIBLE_DEVICES: 1
(gpu_task pid=534840) Number of GPUs: 1
(gpu_task pid=534840) Device 0: _CudaDeviceProperties(name='NVIDIA A100 80GB PCIe', major=8, minor=0, total_memory=81050MB, multi_processor_count=108)
(gpu_task pid=534840) #####

Versions / Dependencies

docker image: https://github.com/vllm-project/vllm/blob/main/Dockerfile.rocm python3.9 (built in conda env from the image, torch and ray is already installed) torch: 2.4.0.dev20240612+rocm6.1 ray: 2.31.0 ROCM: 6.1.2.60102-119~20.04 GPU: MI210

Reproduction script

test_amd.py

import os
import ray
import torch

ray.init(num_gpus=2)

@ray.remote(num_gpus=1)
class GPUActor:
    def ping(self):
        print('##### actor')
        print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"]))
        print("ROCR_VISIBLE_DEVICES: {}".format(os.environ["ROCR_VISIBLE_DEVICES"]))
        if torch.cuda.is_available():
            print(f"Number of GPUs: {torch.cuda.device_count()}")
            for i in range(torch.cuda.device_count()):
                print(f"Device {i}: {torch.cuda.get_device_properties(i)}")
        else:
            print("No GPUs available")
        print('#####')

@ray.remote(num_gpus=1)
def gpu_task():
    print('##### task')
    print("GPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["GPU"]))
    print("ROCR_VISIBLE_DEVICES: {}".format(os.environ["ROCR_VISIBLE_DEVICES"]))
    if torch.cuda.is_available():
        print(f"Number of GPUs: {torch.cuda.device_count()}")
        for i in range(torch.cuda.device_count()):
            print(f"Device {i}: {torch.cuda.get_device_properties(i)}")
    else:
        print("No GPUs available")
    print('#####')

gpu_actor = GPUActor.remote()
ray.get(gpu_actor.ping.remote())
# The actor uses the first GPU so the task uses the second one.
ray.get(gpu_task.remote())

To run:

ROCR_VISIBLE_DEVICES=0,1 python test_amd.py

Issue Severity

High: It blocks me from completing my task.

rynewang commented 1 month ago

@vickytsang can you take a look at this? Thanks

vickytsang commented 4 weeks ago

I am not able to reproduce the issues in the environment below. Please retest with the latest Dockerfile.rocm. I will retest with MI210 as specified by the reporter and report back. I do not expect the results to be any different.

Docker image: built from Dockerfile.rocm in https://github.com/vllm-project/vllm/commit/7ecee3432110bae563c8756a66b54e5f08dc777d torch: 2.5.0.dev20240726+rocm6.1 python3.9 (built in conda env from the image, torch is already installed, ray is not installed) ray 2.34.0 rocm-6.1.2

2024-08-07 23:43:27,414 INFO worker.py:1781 -- Started a local Ray instance.
(GPUActor pid=15430) ##### actor
(GPUActor pid=15430) GPU IDs: ['0']
(GPUActor pid=15430) ROCR_VISIBLE_DEVICES: 0
(GPUActor pid=15430) Number of GPUs: 1
(GPUActor pid=15430) Device 0: _CudaDeviceProperties(name='AMD Instinct MI300X', major=9, minor=4, gcnArchName='gfx942:sramecc+:xnack-', total_memory=196592MB, multi_processor_count=304, uuid=c172637a402d167b)
(GPUActor pid=15430) #####
(gpu_task pid=15553) ##### task
(gpu_task pid=15553) GPU IDs: ['1']
(gpu_task pid=15553) ROCR_VISIBLE_DEVICES: 1
(gpu_task pid=15553) Number of GPUs: 1
(gpu_task pid=15553) Device 0: _CudaDeviceProperties(name='AMD Instinct MI300X', major=9, minor=4, gcnArchName='gfx942:sramecc+:xnack-', total_memory=196592MB, multi_processor_count=304, uuid=84199e04b7dffe69)
(gpu_task pid=15553) #####
vickytsang commented 3 weeks ago

Verified on MI210.

Docker image: built from Dockerfile.rocm in https://github.com/vllm-project/vllm/commit/7ecee3432110bae563c8756a66b54e5f08dc777d torch: 2.5.0.dev20240726+rocm6.1 python3.9 (built in conda env from the image, torch is already installed) ray 2.34.0 rocm-6.1.2

root@smc-dh144-dc20-u06:/workspace# ROCR_VISIBLE_DEVICES=0,1 python test.py
2024-08-09 16:54:25,793 INFO worker.py:1781 -- Started a local Ray instance.
(GPUActor pid=45113) ##### actor
(GPUActor pid=45113) GPU IDs: ['0']
(GPUActor pid=45113) ROCR_VISIBLE_DEVICES: 0,1
(GPUActor pid=45113) Number of GPUs: 2
(GPUActor pid=45113) Device 0: _CudaDeviceProperties(name='AMD Instinct MI210', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
(GPUActor pid=45113) Device 1: _CudaDeviceProperties(name='AMD Instinct MI210', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
(GPUActor pid=45113) #####
(gpu_task pid=45318) ##### task
(gpu_task pid=45318) GPU IDs: ['1']
(gpu_task pid=45318) ROCR_VISIBLE_DEVICES: 0,1
(gpu_task pid=45318) Number of GPUs: 2
(gpu_task pid=45318) Device 1: _CudaDeviceProperties(name='AMD Instinct MI210', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104) [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(gpu_task pid=45318) #####