[Ray Autoscaler] Autoscaler kills working nodes unexpectedly

yx367563 commented 2 weeks ago

What happened + What you expected to happen

I have found that Ray autoscaler sometimes mistakenly kills some nodes that are working. My scenario is that 400 Ray Tasks are submitted at the same time, but not all of them can be allocated resources at the beginning, and I find that almost every time I start scaler down at the final stage, I get an error report ray.exceptions.NodeDiedError: Task failed due to the node dying. The reason I have troubleshooted so far is that the kill of a working node causes an exception, and then the whole job fails. I am currently using Ray 2.23.0, Kuberay 1.0.0, without autoscaler v2 enabled. Since the development of autoscaler v2 has been temporarily suspended, I'd like to ask if there is any solution, or if I can find out what is causing this and what the code logic is?

Versions / Dependencies

Ray 2.23.0 Kuberay 1.0.0

Reproduction script

You can reproduce my problem with the following code, be careful not to make all tasks executable at the same time, there needs to be a task in pending for the bug to be triggered. looking forward to your replies!

import ray
import time
import os
import random

@ray.remote(max_retries=5, num_cpus=8)
def inside_ray_task():
    sleep_time = random.randint(120, 300) # The node idleTimeoutSeconds is 60.

    start_time = time.perf_counter()
    while True:
        if(time.perf_counter() - start_time < sleep_time):
            time.sleep(0.001)
        else:
            break

@ray.remote(max_retries=0)
def outside_ray_task():
    future_list = []
    for i in range(100):
        future_list.append(inside_ray_task.remote())
    ray.get(future_list)

if __name__ == '__main__':
    ray.init()
    ray.get(outside_ray_task.remote())

Issue Severity

High: It blocks me from completing my task.

yx367563 commented 2 weeks ago

I have tried autoscaler v2 and the bug did go away, but it introduced other new problems, based on the fact that development of autoscaler v2 is currently on hold, so I'm wondering if there's a better solution?

anyscalesam commented 2 weeks ago

Thanks for reproing on ASv2 ... we'll take a look at this @yx367563

yx367563 commented 1 week ago

@anyscalesam Or I would also be happy to fix this issue in autoscaler v1. Where can I get the documentation for autoscaler v1? I feel it is difficult to find the problem just by reading the source code. Or you can share with me the possible reasons you can think of.

rynewang commented 1 week ago

@yx367563 can you provide a more detailed repro setup script including the scaling down part? We plan to repro it and try to fix.

Expected behavior: when a task is assigned to a node, in state "waiting for resources", and the node dead. The task should be reassigned to other nodes, and here it looks like they are marked dead.

rynewang commented 1 week ago

@kevin85421 can you try to repro this with @yx367563 ?

yx367563 commented 1 week ago

Of course, I set up a single worker with minWorkerNum of 2, maxWorkerNum of 1000, each worker requires 8 cpu, and idleTimeoutSeconds of 60. The cluster can only provide 24 workers at most, so if the above code is executed, at most 24 tasks can be executed at the same time, and the rest of the tasks will be in pending state. This bug is triggered almost every time the last tasks are scheduled and scale-down occurs. cc @rynewang @kevin85421

anyscalesam commented 5 days ago

@rynewang is P1 right for this; (OSS AS killing nodes unexpectedly feels like an important issue).