ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.71k stars 5.73k forks source link

[Ray-Clusters] Azure Cluster Autoscaler failing to start worker nodes when using non-spot instances. #46198

Closed MKLepium closed 1 month ago

MKLepium commented 4 months ago

What happened + What you expected to happen

When spinning up workers in a ray cluster using the autoscaler the creation will fail and it will keep trying to initiate new virtual machines and in the process spin up public ip's and network interfaces until the limit is reached.

In the last version the following example with neither priority nor billing profile option set works as intended.

# optionally set priority to use Spot instances
#priority: Spot
# set a maximum price for spot instances if desired
#billingProfile:
#    maxPrice: -1

When running with the new latest version, I run into the following Issue: (This is a log of the monitor.log from the headnode)

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/node_launcher.py", line 113, in _launch_node
    created_nodes = self.provider.create_node_with_resources_and_labels(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/node_provider.py", line 152, in create_node_with_resources_and_labels
    return self.create_node(node_config, tags, count)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 247, in create_node
    self._create_node(node_config, tags, count)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 300, in _create_node
    create_or_update(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/azure/core/tracing/decorator.py", line 94, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/azure/core/polling/_poller.py", line 270, in wait
    raise self._exception  # type: ignore
  File "/home/ray/anaconda3/lib/python3.9/site-packages/azure/core/polling/_poller.py", line 185, in _start
    self._polling_method.run()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/azure/core/polling/base_polling.py", line 772, in run
    raise HttpResponseError(response=self._pipeline_response.http_response, error=err) from err
azure.core.exceptions.HttpResponseError: (DeploymentFailed) At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.
Code: DeploymentFailed
Message: At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.
Exception Details:  (BadRequest) {
      "error": {
        "code": (BadRequest) {
      "error": {
        "code": "InvalidParameter",
        "message": "Eviction policy can be set only on Azure Spot Virtual Machines. For more information, see http://aka.ms/AzureSpot/errormessages.",
        "target": "billingProfile"
      }
    }

To clarify: This happens when the head node tries to spin up worker nodes.

The current workaround for me is just to use the older version for the head image Instead of: head_image: "rayproject/ray-ml:latest-cpu" Use: head_image: "rayproject/ray-ml:2.24.0-cpu'

Versions / Dependencies

The new Version of the rayproject/ray-ml - latest-cpu. I assume it was build with the new version of 2.30 of ray.

Reproduction script

I used the example azure-full script but disabled the priority option to NOT request spot instances. (I have not tested if it still crashes if you request spot instances) Simply run the start command: ray up -y azure-full.yaml And wait for the autoscaler to attempt to start a VM.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

MKLepium commented 4 months ago

After looking into it: Commit e38e576 introduced the evictionPolicy. However as you can tell from the log output this can only be set if spot instances are used. So for situations where someone is not using spot instances, azure returns an error and the start of the worker node fails.