Closed MKLepium closed 1 month ago
After looking into it: Commit e38e576 introduced the evictionPolicy. However as you can tell from the log output this can only be set if spot instances are used. So for situations where someone is not using spot instances, azure returns an error and the start of the worker node fails.
What happened + What you expected to happen
When spinning up workers in a ray cluster using the autoscaler the creation will fail and it will keep trying to initiate new virtual machines and in the process spin up public ip's and network interfaces until the limit is reached.
In the last version the following example with neither priority nor billing profile option set works as intended.
When running with the new latest version, I run into the following Issue: (This is a log of the monitor.log from the headnode)
To clarify: This happens when the head node tries to spin up worker nodes.
The current workaround for me is just to use the older version for the head image Instead of: head_image: "rayproject/ray-ml:latest-cpu" Use: head_image: "rayproject/ray-ml:2.24.0-cpu'
Versions / Dependencies
The new Version of the rayproject/ray-ml - latest-cpu. I assume it was build with the new version of 2.30 of ray.
Reproduction script
I used the example azure-full script but disabled the priority option to NOT request spot instances. (I have not tested if it still crashes if you request spot instances) Simply run the start command:
ray up -y azure-full.yaml
And wait for the autoscaler to attempt to start a VM.Issue Severity
Medium: It is a significant difficulty but I can work around it.