ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.15k stars 373 forks source link

[Feature] Autoscaler to support scaling from zero based on `custom_resource` or `accelerator_type` #863

Open zoltan-fedor opened 1 year ago

zoltan-fedor commented 1 year ago

Search before asking

Description

Currently kuberay's autoscaler can't scale worker groups from zero based on their custom_resource or accelerator_type.

This is because the autoscaler learns about the custom_resource or accelerator_type from ray running in that worker group, but when the worker group is set to scale from zero, there is no ray running, so the autoscaler can't learn about custom_resource or accelerator_type of the worker group, so it doesn't know that that is the worker group it needs to scale up from zero to get that requested custom_resource or accelerator_type.

We had a long discussion about this with Miha Jenko on Slack at https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1673194077107389

Use case

You could see use cases about this with custom_resource or accelerator_type request for from zero-scaling.

I have a use case where I occasionally need a specific accelerator type, so I have created a worker group (worker group B) for it with scaling set to zero (I don't need that GPU at all times, only occasionally).

At the same time I have some other worker groups (worker group A) with a different GPU, so when I need that occasional GPU then I use the accelerator_type for ray options to get that from that worker group (worker group B). This works, as long as that worker group (worker group B) is scaled from 1, as then the autoscaler knows what accelerator that group has.

Once I set the min scaling to zero of that worker group (worker group B), then this doesn't work, as the autoscaler doesn't know that that worker group (worker group B) has that given accelerator type.

Setting the accelerator type as a custom resource in the ray params doesn't help, as ray of that worker group (worker group B) is not running until that group is not scaled up - so it is a catch-22 - once it gets scaled, only then it knows that that is the group to scale.

The only way to solve this would be to supply the accelerator type to the ray head directly - instead of expecting the worker group to self-report, so then it would know that that worker group services that accelerator type (or custom resource), even when that group is scaled to zero. Something like: ray start --head --resources {"workergroupname1:accelerator_type:T4": 1}

Related issues

No response

Are you willing to submit a PR?

zoltan-fedor commented 1 year ago

I was asked to post the KubeRay config and a reproduction script. Well, for you to reproduce you will need to create multiple worker groups with GPUs and at least 1 additional with a custom accelerator type or custom resource, which is set to scale from zero.

If you are testing based on accelerator type, then you need multiple worker groups with GPUs, so the group is not scaled based on num_gpus, but based on accelerator_type.

Below is such a worker group setup assuming you are using an AWS g5.xlarge instance, which has a Nvidia Tesla A10G GPU (accelerator type: A10G):

    minReplicas: 0
    maxReplicas: 5
    groupName: gpu-g5-group
    rayStartParams:
      node-ip-address: $MY_POD_IP
      block: 'true'
      metrics-export-port: '8080'
    template:
      metadata:
        labels:
          key: value
        annotations:
          key: value
      spec:
        initContainers:
        - name: init-myservice
          image: busybox:1.28
          command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
        containers:
        - name: machine-learning
          image: ray:2.2.0-py310-gpu
          imagePullPolicy: IfNotPresent
          env:
          - name:  RAY_DISABLE_DOCKER_CPU_WARNING
            value: "1"
          - name: TYPE
            value: "worker"
          - name: CPU_REQUEST
            valueFrom:
              resourceFieldRef:
                containerName: machine-learning
                resource: requests.cpu
          - name: CPU_LIMITS
            valueFrom:
              resourceFieldRef:
                containerName: machine-learning
                resource: limits.cpu
          - name: MEMORY_LIMITS
            valueFrom:
              resourceFieldRef:
                containerName: machine-learning
                resource: limits.memory
          - name: MEMORY_REQUESTS
            valueFrom:
              resourceFieldRef:
                containerName: machine-learning
                resource: requests.memory
          - name: MY_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          ports:
          - containerPort: 80
          - containerPort: 8080
            name: metrics
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "4"  # g5.xlarge
              memory: "15000Mi"  # g5.xlarge
              nvidia.com/gpu: "1"
            requests:
              cpu: "500m"
              memory: "512Mi"
              nvidia.com/gpu: "1"
        nodeSelector:
          nvidia.com/gpu: "true"
          lifecycle: "Ec2GPUOnDemandG5"
        tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

Then you need to start a Ray deployment marked to use that accelerator type (A10G) to observe the issue:

@ray.remote(num_gpus=1, accelerator_type='A10G')
class Foo:
    def method(self):
        return 1

What you will observe that the autoscaler will not be able to scale up the above gpu-g5-group worker group.

If you change the worker group's setup to scale from 1 instead of 0, then the autoscaler will be able to scale that group based on requests for that accelerator type even if you submit multiple ray deployments asking for the 'A10G' accelerator. The issue is with scaling from zero.

DmitriGekhtman commented 1 year ago

Ah, I see the issue -- KubeRay is not currently smart enough to detect the accelerator type ahead of time. Could you try this?

    rayStartParams:
      node-ip-address: $MY_POD_IP
      block: 'true'
      metrics-export-port: '8080'
      resources: '"{\"accelerator_type:A10G\": 1}"'
DmitriGekhtman commented 1 year ago

I think the docs don't adequately address how the accelerator_type argument is supposed to work and what its limitations are.

cc @wuisawesome @gvspraveen @kevin85421

zoltan-fedor commented 1 year ago

@DmitriGekhtman, I have already tried that and it didn't work.

It is not surprising, as setting the ray start param should NOT be making a difference for a worker group scaling from zero, as until it is on zero nodes, ray is actually not running on that worker group, so whatever ray param is set, it should not make a difference. That is exactly what I have found - when setting the ray start param, it did not make a difference for the worker group with scaling from zero.

zoltan-fedor commented 1 year ago

Actually what I have tried was resources: "'{\"A10G\": 1}'" and that didn't work, which seems logical. I will also try resources: "'{\"accelerator_Type:A10G\": 1}'" tomorrow, but again, logic dictates that that shouldn't work either for worker groups scaled from zero. But who knows, maybe I will be surprised. :-)

zoltan-fedor commented 1 year ago

@DmitriGekhtman, I have to eat my words - you were right!!!

Scaling the worker group from zero does work with using the right rayStartParams, such as resources: '"{\"accelerator_type:A10G\": 1}"'

I have assumed, that that can't work, as when scaling from zero that Ray instance is not running until the worker group hasn't scaled that, so it wouldn't matter what are the rayStartParams, but I have tested it and in fact it works! Not sure why, but it does work. :-)

So in summary, this ticket is really just a documentation ticket to ensure that this is mentioned in the docs.

To summarize for those who are just looking for a solution - if you want to scale your worker group with a special GPU (like NVidia Tesla A10G in the below example) from zero, then define it as the following (see the resources in the rayStartParams):

  - replicas: 0
    minReplicas: 0
    maxReplicas: 5
    groupName: gpu-g5-group
    rayStartParams:
      node-ip-address: $MY_POD_IP
      block: 'true'
      metrics-export-port: '8080'
      resources: '"{\"accelerator_type:A10G\": 1}"'
    template:
      ...
sihanwang41 commented 1 year ago

Hi @zoltan-fedor , should custom_resource without accelerator_type also work? if not, could you also post your config file? cc: @DmitriGekhtman

change your code to

@ray.remote(num_gpus=1, resources={"YourCustomResource": 1})
class Foo:
    def method(self):
        return 1
DmitriGekhtman commented 1 year ago

Using a custom resource without accelerator type would also work.

Under the hood, the accelerator type argument uses a custom resource with a name of the form described above. On the other hand, the autoscaler reads custom resource info from the RayCluster CR. This is why we're able to scale up in this case.