[autoscaler] request_resources with partial instance availability leads to workers never shutting down

mattearllongshot commented 4 years ago

What is the problem?

Ray: 1.0.0 Python: 3.8.1 OS: Ubuntu 18.04

Requesting more resources than are available from the node provider results in idle nodes never being terminated. The issue relates to these lines in StandardAutoscaler._update:

        target_workers = self.target_num_workers()     # 1

        if len(nodes) >= target_workers:       # 2
            if "CPU" in self.resource_requests:
                del self.resource_requests["CPU"]
...
        nodes_to_terminate = []
        for node_id in nodes:
            node_ip = self.provider.internal_ip(node_id)
            if (node_ip in last_used and last_used[node_ip] < horizon) and \
                    (len(nodes) - len(nodes_to_terminate)   # 3
                     > target_workers):
                logger.info("StandardAutoscaler: "
                            "{}: Terminating idle node".format(node_id))
                nodes_to_terminate.append(node_id)

The problem is probably best illustrated with an example. Imagine request_resources(100) is called, and our instances have 10 CPU each, however there are only 5 instances available from the node provider (ie. 50 CPU) total. Then target_num_workers() returns 10, since that is how many nodes are required to meet the request. However, because our node provider only has capacity for 5 workers, then the condition on the line commented with # 2 is never true. This means the resource request is never achieved, and thus target_workers always remains at 10, but len(nodes) is never greater than 5. As a result the condition # 3 is never true and idle workers are never terminated.

What is needed is some way (automated or manual) to cancel the resource request in this scenario. After discussing within our team, we came up with the following three options:

Add a call cancel_resources call alongside request_resources which cancels all currently requested resources. It would work around the issue, but requires the caller to have to take care to call cancel_resources.
Add a timeout to the request_resources call (either hardcoded, or as a kwarg to the request_resources call). More transparent to the user, but requires us to select a suitable default timeout.
Cancel the resource request as soon as we hit an instance availability error from the node provider. This could be tricky to implement when using mixed worker types?

Reproduction (REQUIRED)

On a node provider with instances that have 10 CPU resource each, but there are only 5 instances available:

@ray.remote(num_cpus=1)
def task_func():
    time.sleep(60)

request_resource(100)   # Attempt to launch 10 worker nodes, ie. 1 CPU per task
ray.get([task_func.remote() for _ in range(100)])  # Run the tasks

The autoscaler acquires 5 workers, but never shuts them down, even after all the tasks have run and the idle timeout has elapsed.

If we cannot run your script, we cannot fix your issue.

[ ] I have verified my script runs in a clean environment and reproduces the issue.
[ ] I have verified the issue also occurs with the latest wheels.

rkooo567 commented 4 years ago

cc @wuisawesome

mattearllongshot commented 4 years ago

Sorry for the bump, but any thoughts on this?

wuisawesome commented 4 years ago

Thanks for the bump, I will try to take a closer look at this and get back to you.

ray-project / ray

[autoscaler] request_resources with partial instance availability leads to workers never shutting down #11367

What is the problem?

Reproduction (REQUIRED)