ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.2k stars 388 forks source link

[Bug] Can't scaler up when using autoscaler v2 #2223

Open yx367563 opened 3 months ago

yx367563 commented 3 months ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

For the same environment, only changing the use of autoscaler v1 or v2, for a one-time submission of 8000 tasks, v1 can work normally, but v2 will always be stuck, can not be scaled up version: Ray 2.23.0 Kuberay 1.1.1

Reproduction script

import ray
import time
import os
import random

@ray.remote(max_retries=5, num_cpus=8)
def inside_ray_task():
    sleep_time = random.randint(120, 600)

    start_time = time.perf_counter()
    while True:
        if(time.perf_counter() - start_time < sleep_time):
            time.sleep(0.001)
        else:
            break

@ray.remote(max_retries=0)
def outside_ray_task():
    future_list = []
    for i in range(8000):
        future_list.append(inside_ray_task.remote())
    ray.get(future_list)

if __name__ == '__main__':
    ray.init("ray://localhost:10001")
    ray.get(outside_ray_task.remote())

3adc4197-8928-4f55-9bce-a332d21b3b07

Anything else

I want to know what has made recent progress in AutoScaler V2? It seems that it has not been updated for a long time

Are you willing to submit a PR?

jjyao commented 3 months ago

Hi @yx367563, for now please use autoscaler v1. v2 development is pause right now due to limited resource.

yx367563 commented 3 months ago

@jjyao In fact, I want to use autosclaer v2 simply because there was a problem with killing working nodes in v1(https://github.com/ray-project/ray/issues/46492). I was recommended to try v2 and the bug was indeed eliminated, and would like to ask if there is any solution in v1?

rickyyx commented 3 months ago

Thanks for reporting @yx367563 . Would it be easy for you to share some head node logs (particularly the monitor logs) with v2?

yx367563 commented 3 months ago

Sorry, I have stopped using autoscaler v2. I hope this bug can be fixed in v1 (https://github.com/ray-project/ray/issues/46492).

rickyyx commented 3 months ago

Sorry, I have stopped using autoscaler v2. I hope this bug can be fixed in v1 (ray-project/ray#46492).

Sure - I will see if i have time to repro this on my end. Thanks!

yx367563 commented 3 months ago

@rickyyx Thank you! And looking forward to receiving your feedback!