ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.01k stars 341 forks source link

[Bug] Use the ray-ml:2.7.0 Docker image for testing the ray-service.text-summarizer. Found kuberay does not support Ascend NPU parameters as rayStartParams #1984

Open chenpengxiang2015 opened 4 months ago

chenpengxiang2015 commented 4 months ago

Search before asking

KubeRay Component

apiserver

What happened + What you expected to happen

I’m using this demo(ray-service.text-summarizer.yaml) to test

I had edited yaml’s workerGroupSpecs section,like this

workerGroupSpecs:
# The pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 10
groupName: gpu-group
rayStartParams:
resources: ‘{“NPU”: 1}’
# Pod template
template:
spec:
nodeName: npu-1
containers:
- name: ray-worker
image: registry.paas/cmss/rayproject/ray-ml:2.7.0
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /mnt
name: zip
resources:
limits:
cpu: 4
memory: “16G”
huawei.com/Ascend910:) 1
requests:
cpu: 3
memory: “12G”
huawei.com/Ascend910:) 1
…

when i use kubectl apply this file, I found the worker pod’s Status is CrashLoopBackOff. I got this error:

kubectl --namespace ray-system logs pod/text-summarizer-raycluster-mzs2d-worker-gpu-group-kk2sr
Defaulted container “ray-worker” out of: ray-worker, wait-gcs-ready (init)
Usage: ray start [OPTIONS]
Try ‘ray start --help’ for help.

Error: Got unexpected extra argument (1})

Reproduction script

# The pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 10
groupName: gpu-group
rayStartParams:
resources: ‘{“NPU”: 1}’
# Pod template
template:
spec:
nodeName: npu-1
containers:
- name: ray-worker
image: registry.paas/cmss/rayproject/ray-ml:2.7.0
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /mnt
name: zip
resources:
limits:
cpu: 4
memory: “16G”
huawei.com/Ascend910:) 1
requests:
cpu: 3
memory: “12G”
huawei.com/Ascend910:) 1
…

use kubectl apply this file

Anything else

No response

Are you willing to submit a PR?