ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.51k stars 5.69k forks source link

[autoscaler] request_resources with partial instance availability leads to workers never shutting down #11367

Open mattearllongshot opened 4 years ago

mattearllongshot commented 4 years ago

What is the problem?

Ray: 1.0.0 Python: 3.8.1 OS: Ubuntu 18.04

Requesting more resources than are available from the node provider results in idle nodes never being terminated. The issue relates to these lines in StandardAutoscaler._update:

        target_workers = self.target_num_workers()     # 1

        if len(nodes) >= target_workers:       # 2
            if "CPU" in self.resource_requests:
                del self.resource_requests["CPU"]
...
        nodes_to_terminate = []
        for node_id in nodes:
            node_ip = self.provider.internal_ip(node_id)
            if (node_ip in last_used and last_used[node_ip] < horizon) and \
                    (len(nodes) - len(nodes_to_terminate)   # 3
                     > target_workers):
                logger.info("StandardAutoscaler: "
                            "{}: Terminating idle node".format(node_id))
                nodes_to_terminate.append(node_id)

The problem is probably best illustrated with an example. Imagine request_resources(100) is called, and our instances have 10 CPU each, however there are only 5 instances available from the node provider (ie. 50 CPU) total. Then target_num_workers() returns 10, since that is how many nodes are required to meet the request. However, because our node provider only has capacity for 5 workers, then the condition on the line commented with # 2 is never true. This means the resource request is never achieved, and thus target_workers always remains at 10, but len(nodes) is never greater than 5. As a result the condition # 3 is never true and idle workers are never terminated.

What is needed is some way (automated or manual) to cancel the resource request in this scenario. After discussing within our team, we came up with the following three options:

Reproduction (REQUIRED)

On a node provider with instances that have 10 CPU resource each, but there are only 5 instances available:

@ray.remote(num_cpus=1)
def task_func():
    time.sleep(60)

request_resource(100)   # Attempt to launch 10 worker nodes, ie. 1 CPU per task
ray.get([task_func.remote() for _ in range(100)])  # Run the tasks

The autoscaler acquires 5 workers, but never shuts them down, even after all the tasks have run and the idle timeout has elapsed.

If we cannot run your script, we cannot fix your issue.

rkooo567 commented 4 years ago

cc @wuisawesome

mattearllongshot commented 4 years ago

Sorry for the bump, but any thoughts on this?

wuisawesome commented 4 years ago

Thanks for the bump, I will try to take a closer look at this and get back to you.