Open pinduzera opened 5 months ago
As an Additional. Seems like if you try to launch new pods while the previous are in Terminating state, the newer ones will hang as well and never spawn.
@pinduzera,
Could you try to remove
env:
- name: RAY_enable_autoscaler_v2 # Pass env var for the autoscaler v2.
value: "1"
This enables the experimental autoscaler v2 which might have some bugs. Removing this will use the default autoscaler v1.
@pinduzera,
Could you try to remove
env: - name: RAY_enable_autoscaler_v2 # Pass env var for the autoscaler v2. value: "1"
This enables the experimental autoscaler v2 which might have some bugs. Removing this will use the default autoscaler v1.
That worked, thanks. I had copied the example from the repo and missed that it was an using an alpha autoscaler. Thought the v2 was standard since it was the example.
Thanks :)
What happened + What you expected to happen
Not sure if it should be here or under the Kuberay project.
What should happen: When using the autoscaler on Azure Kubernetes service (aks), it should create new worker pods in the "pending state" (waiting for the nodes to be provided). I've documented in detail on this ray post.
What happened: Usually 1 or 2 worker pods doesn't spawn.
After all the nodes finishes running (cancelled/killed)
ray status
gives me the following output, which you will notice that it still pending 1 worker launching even after the job finished. The request seems to never have reached the Kubernetes cluster or ray doesn't know that it didn't reach it.kubectl get pods
doesn't show any worker pod in pending state, which is expected after the job finishedVersions / Dependencies
Ray 2.21.0 python 3.10.14 (this is the example) Ray 2.23.0 Python 3.11.7 (also tested with these versions)
This is my yaml file, which is a variation of the one found on kuberay project , with some antiAffinity to guarantee workers never use the same node.
(You probably don't need the volume claim and mount, but I will leave it there since it is how I was using it)
Reproduction script
The goal was to spawn some actors (or placement groups) and "hot start" all the nodes before doing any computation. But if I start the actors too quickly what will happen is that one or two workers will never be spawn (and can't be seen under
kubectl get pods
). So, I've added a workaround that is to wait 3 seconds before creating the next cluster, which seems to make the ray autoscaler not to fail.Each Node have 8 Cpus, so a asking for 8 cpus fills a single worker (and a single node)
Issue Severity
Medium: It is a significant difficulty but I can work around it.