Open vara-bonthu opened 6 months ago
@vara-bonthu Can you scope this down to just what's missing/ wrong in Ray Serve? Is this really an issue that's required to change Ray code? Or if you already know what's the fix, also feel free to contribute to the codebase
Does Ray for neuron support autoscaling based on neuron devices - represented by the device plugin as aws.amazon.com/neuron ?
@vara-bonthu those are great findings! If you are develop on a mac, you can try those instructions to setup it up locally https://docs.ray.io/en/master/ray-contribute/development.html#building-ray-on-linux-macos-full
If this has to go onto a cluster, I think you can raise a draft PR and one of the CI step will generate a wheel that you can use to build the docker image for testing. This is an example of such build that generates the wheel.
Any updated? I got same problem with @vara-bonthu
I am using : rayVersion: 2.20.0 + Python 3.10 running on AWS inf2.8xlarge, not using Karpenter.
my deployment file :
kind: RayService
metadata:
name: llm
spec:
serviceUnhealthySecondThreshold: 900
deploymentUnhealthySecondThreshold: 300
serveConfigV2:
applications:
- name: neuron-deployment
route_prefix: /
import_path: vllm_deploy:vllm_app
runtime_env:
env_vars:
NEURON_CC_FLAGS: "-O1"
deployments:
- name: neuron_model
max_ongoing_requests: 100
max_queued_requests: -1
autoscaling_config:
min_replicas: 1
initial_replicas: 2
max_replicas: 6
target_num_ongoing_requests_per_replica: 2.0
target_ongoing_requests: 1.0
metrics_interval_s: 0.2
look_back_period_s: 2
smoothing_factor: 1.0
downscale_delay_s: 80.0
upscale_delay_s: 2
ray_actor_options:
num_cpus: 5
resources: {"neuron_cores": 2}
rayClusterConfig:
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default
idleTimeoutSeconds: 60
imagePullPolicy: IfNotPresent
rayVersion: 2.20.0
headGroupSpec:
serviceType: ClusterIP
headService:
metadata:
name: llm-raycluster-head-svc
rayStartParams:
dashboard-host: "0.0.0.0"
num-cpus: "0"
num-gpus: "0"
template:
spec:
serviceAccountName: llmray
containers:
- name: ray-head
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
image: url/llm:ray2.20.0-py310-inf2-latest
imagePullPolicy: "Always"
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
volumeMounts:
- mountPath: /home/ray/netrcvolume/
name: netrc-kuberay
readOnly: true
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /home/ray/samples
name: raycluster-autoscaler
resources:
limits:
cpu: 2
memory: 4G
requests:
cpu: 2
memory: 4G
restartPolicy: "Always"
env:
- name: NETRC
value: "/home/ray/netrcvolume/.netrc"
volumes:
- name: netrc-kuberay
secret:
secretName: terraform-llmrayservice-netrc-secret
- configMap:
defaultMode: 511
items:
- key: detached_actor.py
path: detached_actor.py
- key: terminate_detached_actor.py
path: terminate_detached_actor.py
name: raycluster-autoscaler
name: raycluster-autoscaler
- emptyDir: {}
name: ray-logs
workerGroupSpecs:
- groupName: inf2-worker
minReplicas: 1
maxReplicas: 8
replicas:
rayStartParams: {}
template:
spec:
serviceAccountName: llmray
containers:
- name: ray-worker
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
image: url/llm:ray2.20.0-py310-inf2-latest
imagePullPolicy: "Always"
volumeMounts:
- mountPath: /home/ray/netrcvolume/
name: netrc-kuberay
readOnly: true
resources:
limits:
aws.amazon.com/neuron: "1"
cpu: "25"
memory: 110G
requests:
aws.amazon.com/neuron: "1"
cpu: "25"
memory: 110G
restartPolicy: "Always" # Set "Never" if use AutoscalerV2: Prevent container restart to maintain Ray health.
env:
- name: NETRC
value: "/home/ray/netrcvolume/.netrc"
volumes:
- name: netrc-kuberay
secret:
secretName: terraform-llmrayservice-netrc-secret
nodeSelector:
Dedicated: shared-dev-llm
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: shared-dev-llm
I think the issue may be with the particular combo of autoscaling, AWS accelerators, and KubeRay. @kevin85421 @rickyyx might be well-equipped to help out here @anyscalesam this is turning out to be a bit of a thorn for Ray Serve users on AWS.
We need to generalize this bit of KubeRay code https://github.com/ray-project/kuberay/blob/cf41e24d449969632d231e29f394d29f8548bb89/ray-operator/controllers/ray/common/pod.go#L746-L755
I believe the relevant accelerator interfaces have been designed and implemented for Ray OSS EC2 support? Looks like there's a gap for Kubernetes.
@DmitriGekhtman wanna take a crack of it; no need for a formal REP for this I think > just slap a Google doc together and we can take a look.
No guarantees, but will take a closer look if/when some time presents itself.
Discussed today during biweekly ray contirbutor sync up @DmitriGekhtman will pick this up cc @GeneDer
Ok, I understand the issue in a bit more detail -- there is logic in the codebase for inferring neuron counts from the file system https://github.com/ray-project/ray/blob/master/python/ray/_private/accelerators/neuron.py#L57 and from EC2 autoscaling config https://github.com/ray-project/ray/blob/master/python/ray/_private/accelerators/neuron.py#L24
Unfortunately, Ray's support for custom accelerators does not currently extend to detecting the presence of these accelerators from Kubernetes extended resources in KubeRay pod configs, as in aws.amazon.com/neuron: "1"
.
As I mentioned before, this would require an extension of the KubeRay logic that handles gpu extended resources here https://github.com/ray-project/kuberay/blob/cf41e24d449969632d231e29f394d29f8548bb89/ray-operator/controllers/ray/common/pod.go#L746-L755.
This would be possible to achieve with some work. However, there is a workaround that can be used to unblock this use-case now:
You can specify that each worker in the group has {"neuron_cores": 1}
by specifying this data in the rayStartParams
for the neuron core worker group, as detailed in the docs here: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#id1
The configuration looks like this:
rayStartParams:
resources: '"{\"neuron_cores\": 1}"'
That will signal the available of the neuron_cores
to the Ray autoscaler.
The UX here is regrettably hideous, but it should work! @vara-bonthu Let us know if this solves the issue.
For better UX with Kubernetes extended resources other than GPUs, feel free to open a Ray or KubeRay issue.
@DmitriGekhtman Thanks for investigating this issue. I can confirm that your solution worked, and with this change, I can see the new nodes starting correctly 🥳
For anyone looking for the solution, you can refer to the Data on EKS pattern file in this PR. Please note that this PR is still pending merge, as we are running additional tests to finalize the changes.
I'll close this issue once all tests are complete. In the meantime, the fix involves adding rayStartParams to workerGroupSpecs as shown below:
workerGroupSpecs:
- groupName: inf2-group
replicas: 1
minReplicas: 1
maxReplicas: 3
rayStartParams:
# This setting is critical for inf2/Trn1 node autoscaling with RayServe.
resources: '"{\"neuron_cores\": 2}"'
Glad it's working.
Still, it would be nice if the KubeRay code were modified to autodetect from the Kubernetes resources. cc @jjyao who did similar work for the EC2 cluster launcher
@DmitriGekhtman I have extended the code you pointed out to support neuron cores. Can you take a look.
@DmitriGekhtman I have extended the code you pointed out to support neuron cores. Can you take a look.
Added some comments on the review.
@DmitriGekhtman addressed the comments, Pls review it, whenever you have time
What happened + What you expected to happen
What Happened:
When deploying models using RayServe with autoscaling enabled on Amazon EKS, specifically across multiple
inf2
nodes, the system scales correctly within a singleinf2.24xlarge
instance up to6
replicas. However, beyond 6 replicas, RayServe fails to request new worker nodes, which should trigger Karpenter to provision additional nodes for placing new pods. This issue occurs despite auto-scaling being seemingly configured correctly to handle such scaling.Logs Indicating the Issue:
The logs suggest that Ray's autoscaler could not find a suitable node type to satisfy the resource demands, despite the available Neuron device resources.
Possible Contributing Factors: The
RayStartParam
specifies resources in a manner (resources: {"neuron_cores": 2}
) that might not align perfectly with the resource tags added by the Neuron device plugin to the nodes managed by Karpenter.What You Expected to Happen:
Expected Behavior: RayServe's autoscaling should seamlessly request new worker nodes when the demand exceeds the capacity of the current nodes, especially in scenarios where more than 6 replicas are needed. Karpenter should then be able to provision new nodes based on the resource requests from RayServe, allowing for the continuous scaling of model deployments without manual intervention.
Seamless Integration and Scaling: Given the configuration and resources available, especially with Neuron devices on Inf2 instances, I expected a smooth scaling experience that leverages the Neuron core resources effectively across multiple nodes, allowing for a greater number of model replicas to be deployed and managed dynamically based on load.
Additional Information:
Deployment Configuration: The issue arises with a specific RayServe configuration designed for deploying models on Amazon EKS with Inf2 instances. The configuration details can be found at RayServe Configuration for Stable Diffusion on EKS.
Potential Misalignment with Neuron Device Plugin and Karpenter: The issue might stem from how Neuron device resources are tagged and utilized by Karpenter in response to RayServe's resource requests, suggesting a potential area for troubleshooting and adjustment.
Versions / Dependencies
rayproject/ray:2.9.0-py310
. Check Dockerfile here https://github.com/awslabs/data-on-eks/blob/main/ai-ml/trainium-inferentia/examples/inference/ray-serve/stable-diffusion-inf2/DockerfileReproduction script
Steps to Reproduce:
Deploy Infrastructure and RayServe Model Inference:
Follow the instructions provided in the blueprint for deploying the infrastructure and RayServe model inference for Stable Diffusion. This comprehensive guide is available at the following URL: Deploying StableDiffusion Model Inference on EKS. This guide outlines the necessary steps to set up Amazon EKS, configure Karpenter, deploy RayServe, and prepare the model for inference. Generate concurrent requests:
Utilize Postman to simulate multiple concurrent requests to the deployed RayServe model endpoint. The objective is to create a workload that triggers the auto-scaling behavior of RayServe, necessitating the scaling of replicas and, consequently, the provisioning of additional nodes by Karpenter.
Monitor Logs for Scaling Activity:
Keep an eye on the Ray dashboard and Karpenter logs to observe the scaling behavior. The expectation is for the number of replicas to increase in response to the simulated demand, leading to Karpenter being prompted to provision new nodes to accommodate the additional replicas. Identify autoscaling limitations:
The critical point of observation is when the number of replicas reaches 6. Beyond this point, note whether RayServe attempts to scale beyond the existing node capacity and if Karpenter responds by provisioning additional nodes. The failure to do so underlines the issue being reported.
Expected Outcome: The infrastructure and RayServe deployment should scale seamlessly in response to increased demand, with Karpenter provisioning new nodes as required to host the additional replicas.
Issue Severity
High: It blocks me from completing my task.