[RayServe] Autoscaling Issue with Neuron Devices (Inf2), RayServe, and Karpenter on EKS

vara-bonthu commented 6 months ago

What happened + What you expected to happen

What Happened:

When deploying models using RayServe with autoscaling enabled on Amazon EKS, specifically across multiple inf2 nodes, the system scales correctly within a single inf2.24xlarge instance up to 6 replicas. However, beyond 6 replicas, RayServe fails to request new worker nodes, which should trigger Karpenter to provision additional nodes for placing new pods. This issue occurs despite auto-scaling being seemingly configured correctly to handle such scaling.

Logs Indicating the Issue:

The logs suggest that Ray's autoscaler could not find a suitable node type to satisfy the resource demands, despite the available Neuron device resources.

{'CPU': 1.0, 'neuron_cores': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333}: 6+ pending tasks/actors

2024-03-22 08:39:58,179 WARNING resource_demand_scheduler.py:782 -- The autoscaler could not find a node type to satisfy the request: [
{'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
 {'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
 {'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
 {'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
{'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0},
 {'CPU': 1.0, 'node:__internal_implicit_resource_stable-diffusion-deployment:stable-diffusion-v2': 0.3333, 'neuron_cores': 1.0}]. 

Please specify a node type with the necessary resources.

Possible Contributing Factors: The RayStartParam specifies resources in a manner (resources: {"neuron_cores": 2}) that might not align perfectly with the resource tags added by the Neuron device plugin to the nodes managed by Karpenter.

What You Expected to Happen:

Expected Behavior: RayServe's autoscaling should seamlessly request new worker nodes when the demand exceeds the capacity of the current nodes, especially in scenarios where more than 6 replicas are needed. Karpenter should then be able to provision new nodes based on the resource requests from RayServe, allowing for the continuous scaling of model deployments without manual intervention.

Seamless Integration and Scaling: Given the configuration and resources available, especially with Neuron devices on Inf2 instances, I expected a smooth scaling experience that leverages the Neuron core resources effectively across multiple nodes, allowing for a greater number of model replicas to be deployed and managed dynamically based on load.

Additional Information:

Deployment Configuration: The issue arises with a specific RayServe configuration designed for deploying models on Amazon EKS with Inf2 instances. The configuration details can be found at RayServe Configuration for Stable Diffusion on EKS.

Potential Misalignment with Neuron Device Plugin and Karpenter: The issue might stem from how Neuron device resources are tagged and utilized by Karpenter in response to RayServe's resource requests, suggesting a potential area for troubleshooting and adjustment.

Versions / Dependencies

KubeRay Operator Helm chat verison : 1.0.0
NeuronrDevicePlugin Image version: 2.14.4.0
Karpenter Version: v0.34.0
Amazon EKS Version: 1.29
RayServe Config yaml
Ray Base image used in the application image : rayproject/ray:2.9.0-py310. Check Dockerfile here https://github.com/awslabs/data-on-eks/blob/main/ai-ml/trainium-inferentia/examples/inference/ray-serve/stable-diffusion-inf2/Dockerfile

Reproduction script

Steps to Reproduce:

Deploy Infrastructure and RayServe Model Inference:

Follow the instructions provided in the blueprint for deploying the infrastructure and RayServe model inference for Stable Diffusion. This comprehensive guide is available at the following URL: Deploying StableDiffusion Model Inference on EKS. This guide outlines the necessary steps to set up Amazon EKS, configure Karpenter, deploy RayServe, and prepare the model for inference. Generate concurrent requests:

Utilize Postman to simulate multiple concurrent requests to the deployed RayServe model endpoint. The objective is to create a workload that triggers the auto-scaling behavior of RayServe, necessitating the scaling of replicas and, consequently, the provisioning of additional nodes by Karpenter.

Monitor Logs for Scaling Activity:

Keep an eye on the Ray dashboard and Karpenter logs to observe the scaling behavior. The expectation is for the number of replicas to increase in response to the simulated demand, leading to Karpenter being prompted to provision new nodes to accommodate the additional replicas. Identify autoscaling limitations:

The critical point of observation is when the number of replicas reaches 6. Beyond this point, note whether RayServe attempts to scale beyond the existing node capacity and if Karpenter responds by provisioning additional nodes. The failure to do so underlines the issue being reported.

Expected Outcome: The infrastructure and RayServe deployment should scale seamlessly in response to increased demand, with Karpenter provisioning new nodes as required to host the additional replicas.

Issue Severity

High: It blocks me from completing my task.

GeneDer commented 6 months ago

@vara-bonthu Can you scope this down to just what's missing/ wrong in Ray Serve? Is this really an issue that's required to change Ray code? Or if you already know what's the fix, also feel free to contribute to the codebase

geetasg commented 5 months ago

Does Ray for neuron support autoscaling based on neuron devices - represented by the device plugin as aws.amazon.com/neuron ?

vara-bonthu commented 5 months ago

@GeneDer The issue is related solely to scaling new nodes for inf2 instances. It appears node_types is not set, causing the code to break at this line in here .

What is the easiest way to debug the code to print the values passed to this method?

I will try to dig further and keep you posted

GeneDer commented 5 months ago

@vara-bonthu those are great findings! If you are develop on a mac, you can try those instructions to setup it up locally https://docs.ray.io/en/master/ray-contribute/development.html#building-ray-on-linux-macos-full

If this has to go onto a cluster, I think you can raise a draft PR and one of the CI step will generate a wheel that you can use to build the docker image for testing. This is an example of such build that generates the wheel.

prd-nguyen-vo commented 4 months ago

Any updated? I got same problem with @vara-bonthu

I am using : rayVersion: 2.20.0 + Python 3.10 running on AWS inf2.8xlarge, not using Karpenter.

my deployment file :


kind: RayService
metadata:
  name: llm
spec:
  serviceUnhealthySecondThreshold: 900
  deploymentUnhealthySecondThreshold: 300
  serveConfigV2:  
    applications:
    - name: neuron-deployment
      route_prefix: /
      import_path: vllm_deploy:vllm_app
      runtime_env: 
        env_vars:
          NEURON_CC_FLAGS: "-O1"
      deployments:
      - name: neuron_model
        max_ongoing_requests: 100
        max_queued_requests: -1
        autoscaling_config:
          min_replicas: 1
          initial_replicas: 2
          max_replicas: 6
          target_num_ongoing_requests_per_replica: 2.0
          target_ongoing_requests: 1.0
          metrics_interval_s: 0.2
          look_back_period_s: 2
          smoothing_factor: 1.0
          downscale_delay_s: 80.0
          upscale_delay_s: 2
        ray_actor_options:
          num_cpus: 5
          resources: {"neuron_cores": 2}
  rayClusterConfig:
    enableInTreeAutoscaling: true
    autoscalerOptions:
      upscalingMode: Default
      idleTimeoutSeconds: 60
      imagePullPolicy: IfNotPresent
    rayVersion: 2.20.0
    headGroupSpec:
      serviceType: ClusterIP
      headService:
        metadata:
          name: llm-raycluster-head-svc
      rayStartParams:
        dashboard-host: "0.0.0.0"
        num-cpus: "0"
        num-gpus: "0"
      template:
        spec:
          serviceAccountName: llmray
          containers:            
            - name: ray-head
              lifecycle:
                preStop:
                  exec:
                    command: ["/bin/sh", "-c", "ray stop"]
              image: url/llm:ray2.20.0-py310-inf2-latest
              imagePullPolicy: "Always"
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
              volumeMounts:
                - mountPath: /home/ray/netrcvolume/
                  name: netrc-kuberay
                  readOnly: true
                - mountPath: /tmp/ray
                  name: ray-logs
                - mountPath: /home/ray/samples
                  name: raycluster-autoscaler
              resources:
                limits:
                  cpu: 2
                  memory: 4G
                requests:
                  cpu: 2
                  memory: 4G
              restartPolicy: "Always"
              env:
                - name: NETRC
                  value: "/home/ray/netrcvolume/.netrc"
          volumes:
            - name: netrc-kuberay
              secret:
                secretName: terraform-llmrayservice-netrc-secret
            - configMap:
                defaultMode: 511
                items:
                - key: detached_actor.py
                  path: detached_actor.py
                - key: terminate_detached_actor.py
                  path: terminate_detached_actor.py
                name: raycluster-autoscaler
              name: raycluster-autoscaler
            - emptyDir: {}
              name: ray-logs
    workerGroupSpecs:
      - groupName: inf2-worker
        minReplicas: 1
        maxReplicas: 8
        replicas: 
        rayStartParams: {}
        template:
          spec:
            serviceAccountName: llmray
            containers:              
              - name: ray-worker
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                image: url/llm:ray2.20.0-py310-inf2-latest
                imagePullPolicy: "Always"
                volumeMounts:
                  - mountPath: /home/ray/netrcvolume/
                    name: netrc-kuberay
                    readOnly: true
                resources:
                  limits:
                    aws.amazon.com/neuron: "1"
                    cpu: "25"
                    memory: 110G
                  requests:
                    aws.amazon.com/neuron: "1"
                    cpu: "25"
                    memory: 110G
                restartPolicy: "Always" # Set "Never" if use AutoscalerV2: Prevent container restart to maintain Ray health.
                env:
                  - name: NETRC
                    value: "/home/ray/netrcvolume/.netrc"
            volumes:
              - name: netrc-kuberay
                secret:
                  secretName: terraform-llmrayservice-netrc-secret            
            nodeSelector:
              Dedicated: shared-dev-llm
            tolerations:
            - effect: NoSchedule
              key: dedicated
              operator: Equal
              value: shared-dev-llm

DmitriGekhtman commented 2 months ago

I think the issue may be with the particular combo of autoscaling, AWS accelerators, and KubeRay. @kevin85421 @rickyyx might be well-equipped to help out here @anyscalesam this is turning out to be a bit of a thorn for Ray Serve users on AWS.

DmitriGekhtman commented 2 months ago

We need to generalize this bit of KubeRay code https://github.com/ray-project/kuberay/blob/cf41e24d449969632d231e29f394d29f8548bb89/ray-operator/controllers/ray/common/pod.go#L746-L755

I believe the relevant accelerator interfaces have been designed and implemented for Ray OSS EC2 support? Looks like there's a gap for Kubernetes.

anyscalesam commented 2 months ago

@DmitriGekhtman wanna take a crack of it; no need for a formal REP for this I think > just slap a Google doc together and we can take a look.

DmitriGekhtman commented 2 months ago

No guarantees, but will take a closer look if/when some time presents itself.

anyscalesam commented 1 month ago

Discussed today during biweekly ray contirbutor sync up @DmitriGekhtman will pick this up cc @GeneDer

DmitriGekhtman commented 1 month ago

Ok, I understand the issue in a bit more detail -- there is logic in the codebase for inferring neuron counts from the file system https://github.com/ray-project/ray/blob/master/python/ray/_private/accelerators/neuron.py#L57 and from EC2 autoscaling config https://github.com/ray-project/ray/blob/master/python/ray/_private/accelerators/neuron.py#L24

Unfortunately, Ray's support for custom accelerators does not currently extend to detecting the presence of these accelerators from Kubernetes extended resources in KubeRay pod configs, as in aws.amazon.com/neuron: "1".

As I mentioned before, this would require an extension of the KubeRay logic that handles gpu extended resources here https://github.com/ray-project/kuberay/blob/cf41e24d449969632d231e29f394d29f8548bb89/ray-operator/controllers/ray/common/pod.go#L746-L755.

This would be possible to achieve with some work. However, there is a workaround that can be used to unblock this use-case now:

You can specify that each worker in the group has {"neuron_cores": 1} by specifying this data in the rayStartParams for the neuron core worker group, as detailed in the docs here: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#id1

The configuration looks like this:

rayStartParams:
    resources: '"{\"neuron_cores\": 1}"'

That will signal the available of the neuron_cores to the Ray autoscaler.

The UX here is regrettably hideous, but it should work! @vara-bonthu Let us know if this solves the issue.

For better UX with Kubernetes extended resources other than GPUs, feel free to open a Ray or KubeRay issue.

vara-bonthu commented 1 month ago

@DmitriGekhtman Thanks for investigating this issue. I can confirm that your solution worked, and with this change, I can see the new nodes starting correctly 🥳

For anyone looking for the solution, you can refer to the Data on EKS pattern file in this PR. Please note that this PR is still pending merge, as we are running additional tests to finalize the changes.

I'll close this issue once all tests are complete. In the meantime, the fix involves adding rayStartParams to workerGroupSpecs as shown below:

    workerGroupSpecs:
    - groupName: inf2-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 3
      rayStartParams:
        # This setting is critical for inf2/Trn1 node autoscaling with RayServe. 
        resources: '"{\"neuron_cores\": 2}"'

DmitriGekhtman commented 1 month ago

Glad it's working.

Still, it would be nice if the KubeRay code were modified to autodetect from the Kubernetes resources. cc @jjyao who did similar work for the EC2 cluster launcher

mounchin commented 6 days ago

@DmitriGekhtman I have extended the code you pointed out to support neuron cores. Can you take a look.

DmitriGekhtman commented 6 days ago

@DmitriGekhtman I have extended the code you pointed out to support neuron cores. Can you take a look.

Added some comments on the review.

mounchin commented 2 days ago

@DmitriGekhtman addressed the comments, Pls review it, whenever you have time

ray-project / ray