Open HCharlie opened 1 month ago
I noticed that the raycluster head pod and worker pods are prepared sequentially, which involves provisioning node for head pod and pulling images, once the head pod is ready and in running status, then the worker pods could start to find a node and pull the image afterwards
I don't think this is the current behavior. KubeRay creates the pod sequentially but it doesn't wait for the head pod to become ready before creating the worker pods.
Here's a simple test I just ran:
$ kind create cluster
Creating cluster "kind" ...
â Ensuring node image (kindest/node:v1.27.3) đŧ
â Preparing nodes đĻ
â Writing configuration đ
â Starting control-plane đšī¸
â Installing CNI đ
â Installing StorageClass đž
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Thanks for using kind! đ
$ helm install kuberay-operator kuberay/kuberay-operator --version 1.2.1
NAME: kuberay-operator
LAST DEPLOYED: Tue Sep 24 14:13:39 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
$ kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.complete.yaml
raycluster.ray.io/raycluster-complete created
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kuberay-operator-84fb78dcfd-66bzx 1/1 Running 0 20s
raycluster-complete-head-jl8xp 0/1 ContainerCreating 0 2s
raycluster-complete-small-group-worker-xnhlr 0/1 Init:0/1 0 1s
What version of KubeRay are you using?
Hi @andrewsykim , thanks for the reply, the kuberay-operator I am using should be v1.1.0, the image tag is 40a946a. Maybe my words are not precise. What I notice is only once the head pod is in the running status, the worker pods start to find instances and pulling the images, before the head pod is in the running status, the worker pods just stay at the pending status. Is there a way to parallel these for head pod and worker pods? Or this is not the case and I observe something wrong, and there's some configuration needed?
my setup is to have several EC2 instances provisioned by Karpenter, for each of them there's only one worker pod take up almost all the resources(cpu, gpu, memory).
Weird, I managed to run the example you shared locally on my Macbook, it seems pretty fast to spin up both the head pod and worker pod.
I think you are right, I checked again, I notice the instances created for the head and worker pods are initialized together in the AWS console, maybe the instance type difference for pulling the image give me the wrong impression things are done sequentially. Thanks again.
Your worker nodes are likely using GPUs and larger instance types that might take longer to scale up and initialize. That could explain the later start-up time for your worker pods compared to the head pod. Usually head pod is CPU only and can run on standard instance types
that's exactly the case.
Search before asking
Description
Hi team,
I noticed that the raycluster head pod and worker pods are prepared sequentially, which involves provisioning node for head pod and pulling images, once the head pod is ready and in running status, then the worker pods could start to find a node and pull the image afterwards, this sequential behavior doubles the time users have to wait, Is there a way to make the sequential behavior parallel to create a better UX?
When doing this on cloud for example AWS, it might take more than 20 minutes for a fresh start depending on the instance type and image to use.
Use case
make the head pods and worker pods provisioning nodes and pull images parallelly
Related issues
No response
Are you willing to submit a PR?