ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.29k stars 412 forks source link

Statistics of other types of gpu #2484

Open zjj2wry opened 3 weeks ago

zjj2wry commented 3 weeks ago

Search before asking

Description

https://github.com/ray-project/kuberay/blob/33ba38546ddda0fa9121df20b357a13d47bb90d9/ray-operator/controllers/ray/raycluster_controller.go#L1623

if use aliyun k8s gpu share, gpu key is aliyun.com/gpu-mem

    workerGroupSpecs:
            resources:
              limits:
                aliyun.com/gpu-mem: "1"
                cpu: "1"
                memory: 2Gi
              requests:
                aliyun.com/gpu-mem: "1"
                cpu: "1"
                memory: 2Gi

autoscaler will not work when request gpu resource

(autoscaler +3m13s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

code:

import ray
import time

ray.init()

@ray.remote(num_gpus=1) 
def gpu_task():
    import torch
    x = torch.rand(10000, 10000).cuda()  
    y = torch.mm(x, x) 
    return y.sum().item()

future = gpu_task.remote()
result = ray.get(future)

print("Result:", result)

ray.shutdown()

Use case

No response

Related issues

none

Are you willing to submit a PR?

win5923 commented 3 weeks ago

Perhaps using strings.Contains could be a better way.

zjj2wry commented 2 weeks ago

https://github.com/ray-project/ray/blob/ba41ae99097c30cac2dd62e263bbe0b7b9bffc95/python/ray/autoscaler/_private/kuberay/autoscaling_config.py#L346-L351

By setting num-gpus, i can solve the problem that the gpu will not automatically expand. desireGPU is just for display purposes.

andrewsykim commented 2 weeks ago

I suggest adding these to I suggest to add these in the list of well known accelerators instread: https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L41-L43 instead of using regex to parse GPU counts