Open zoltan-fedor opened 1 year ago
I was asked to post the KubeRay config and a reproduction script. Well, for you to reproduce you will need to create multiple worker groups with GPUs and at least 1 additional with a custom accelerator type or custom resource, which is set to scale from zero.
If you are testing based on accelerator type, then you need multiple worker groups with GPUs, so the group is not scaled based on num_gpus
, but based on accelerator_type
.
Below is such a worker group setup assuming you are using an AWS g5.xlarge
instance, which has a Nvidia Tesla A10G GPU (accelerator type: A10G
):
minReplicas: 0
maxReplicas: 5
groupName: gpu-g5-group
rayStartParams:
node-ip-address: $MY_POD_IP
block: 'true'
metrics-export-port: '8080'
template:
metadata:
labels:
key: value
annotations:
key: value
spec:
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
containers:
- name: machine-learning
image: ray:2.2.0-py310-gpu
imagePullPolicy: IfNotPresent
env:
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: "1"
- name: TYPE
value: "worker"
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: limits.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: limits.memory
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: requests.memory
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
ports:
- containerPort: 80
- containerPort: 8080
name: metrics
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
cpu: "4" # g5.xlarge
memory: "15000Mi" # g5.xlarge
nvidia.com/gpu: "1"
requests:
cpu: "500m"
memory: "512Mi"
nvidia.com/gpu: "1"
nodeSelector:
nvidia.com/gpu: "true"
lifecycle: "Ec2GPUOnDemandG5"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Then you need to start a Ray deployment marked to use that accelerator type (A10G
) to observe the issue:
@ray.remote(num_gpus=1, accelerator_type='A10G')
class Foo:
def method(self):
return 1
What you will observe that the autoscaler will not be able to scale up the above gpu-g5-group
worker group.
If you change the worker group's setup to scale from 1 instead of 0, then the autoscaler will be able to scale that group based on requests for that accelerator type even if you submit multiple ray deployments asking for the 'A10G' accelerator. The issue is with scaling from zero.
Ah, I see the issue -- KubeRay is not currently smart enough to detect the accelerator type ahead of time. Could you try this?
rayStartParams:
node-ip-address: $MY_POD_IP
block: 'true'
metrics-export-port: '8080'
resources: '"{\"accelerator_type:A10G\": 1}"'
I think the docs don't adequately address how the accelerator_type
argument is supposed to work and what its limitations are.
cc @wuisawesome @gvspraveen @kevin85421
@DmitriGekhtman, I have already tried that and it didn't work.
It is not surprising, as setting the ray start param should NOT be making a difference for a worker group scaling from zero, as until it is on zero nodes, ray is actually not running on that worker group, so whatever ray param is set, it should not make a difference. That is exactly what I have found - when setting the ray start param, it did not make a difference for the worker group with scaling from zero.
Actually what I have tried was resources: "'{\"A10G\": 1}'"
and that didn't work, which seems logical.
I will also try resources: "'{\"accelerator_Type:A10G\": 1}'" tomorrow, but again, logic dictates that that shouldn't work either for worker groups scaled from zero. But who knows, maybe I will be surprised. :-)
@DmitriGekhtman, I have to eat my words - you were right!!!
Scaling the worker group from zero does work with using the right rayStartParams
, such as resources: '"{\"accelerator_type:A10G\": 1}"'
I have assumed, that that can't work, as when scaling from zero that Ray instance is not running until the worker group hasn't scaled that, so it wouldn't matter what are the rayStartParams
, but I have tested it and in fact it works!
Not sure why, but it does work. :-)
So in summary, this ticket is really just a documentation ticket to ensure that this is mentioned in the docs.
To summarize for those who are just looking for a solution - if you want to scale your worker group with a special GPU (like NVidia Tesla A10G in the below example) from zero, then define it as the following (see the resources
in the rayStartParams
):
- replicas: 0
minReplicas: 0
maxReplicas: 5
groupName: gpu-g5-group
rayStartParams:
node-ip-address: $MY_POD_IP
block: 'true'
metrics-export-port: '8080'
resources: '"{\"accelerator_type:A10G\": 1}"'
template:
...
Hi @zoltan-fedor , should custom_resource
without accelerator_type
also work? if not, could you also post your config file? cc: @DmitriGekhtman
change your code to
@ray.remote(num_gpus=1, resources={"YourCustomResource": 1})
class Foo:
def method(self):
return 1
Using a custom resource without accelerator type would also work.
Under the hood, the accelerator type argument uses a custom resource with a name of the form described above. On the other hand, the autoscaler reads custom resource info from the RayCluster CR. This is why we're able to scale up in this case.
Search before asking
Description
Currently kuberay's autoscaler can't scale worker groups from zero based on their
custom_resource
oraccelerator_type
.This is because the autoscaler learns about the
custom_resource
oraccelerator_type
from ray running in that worker group, but when the worker group is set to scale from zero, there is no ray running, so the autoscaler can't learn aboutcustom_resource
oraccelerator_type
of the worker group, so it doesn't know that that is the worker group it needs to scale up from zero to get that requestedcustom_resource
oraccelerator_type
.We had a long discussion about this with Miha Jenko on Slack at https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1673194077107389
Use case
You could see use cases about this with
custom_resource
oraccelerator_type
request for from zero-scaling.I have a use case where I occasionally need a specific accelerator type, so I have created a worker group (worker group B) for it with scaling set to zero (I don't need that GPU at all times, only occasionally).
At the same time I have some other worker groups (worker group A) with a different GPU, so when I need that occasional GPU then I use the
accelerator_type
for ray options to get that from that worker group (worker group B). This works, as long as that worker group (worker group B) is scaled from 1, as then the autoscaler knows what accelerator that group has.Once I set the min scaling to zero of that worker group (worker group B), then this doesn't work, as the autoscaler doesn't know that that worker group (worker group B) has that given accelerator type.
Setting the accelerator type as a custom resource in the ray params doesn't help, as ray of that worker group (worker group B) is not running until that group is not scaled up - so it is a catch-22 - once it gets scaled, only then it knows that that is the group to scale.
The only way to solve this would be to supply the accelerator type to the ray head directly - instead of expecting the worker group to self-report, so then it would know that that worker group services that accelerator type (or custom resource), even when that group is scaled to zero. Something like:
ray start --head --resources {"workergroupname1:accelerator_type:T4": 1}
Related issues
No response
Are you willing to submit a PR?