ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.13k stars 5.8k forks source link

[Core/Autoscaler] High upscaling causes raylet process crash #25760

Open kpavel opened 2 years ago

kpavel commented 2 years ago

What happened + What you expected to happen

When autoscaling fast, with upscaling_speed: 99999, from 0 to 40 worker nodes, the head node raylet process crashes without logging the failure anywhere. The head node is not running out of resources at any point: 32 cpu + 128GB ram

Versions / Dependencies

Happened on previous version and continues to occur with latest ray==1.13.0

Reproduction script

  1. In cluster config file set upscaling_speed: 999999 max_workers: 40 min_workers: 0

  2. Spawn cluster and start load to make autoscaler start all 40 workers in one batch

    
    import ray
    import sys
    import time
    import socket

ray.init(address='ray://141.125.161.44:10001')

@ray.remote(num_cpus=0.2) def exec_cmd(cmd_line):

cmd_line_split = cmd_line.strip().split(' ')
begin = time.time()
try:
    print("cmmm:" + str(cmd_line_split))
    end=time.time()
    duration=end-begin

print(stderr)

    time.sleep(20)
    return("abc" + str(time.time()))

except socket.timeout:
    print('done')

result_ptr_list = []

for i in range(1000): result_pr = exec_cmd.remote(str(i)) result_ptr_list.append(result_pr) results = ray.get(result_ptr_list) print(results)



Logs:
[TMP_RAY_FOLDER.tar.gz](https://github.com/ray-project/ray/files/8899391/TMP_RAY_FOLDER.tar.gz)

### Issue Severity

High: It blocks me from completing my task.
richardliaw commented 2 years ago

@cadedaniel could you do a bit of investigation on this one?

cadedaniel commented 2 years ago

Hi @kpavel.

I tried your script on a GCP cluster but could not reproduce the issue -- the autoscaler correctly started a bunch of nodes, even though it was all at once.

Could you provide more information so I can reproduce the issue? What cloud/environment are you running in? Does "slow" autoscaling work fine? Does this crash occur every time you run the script?

kpavel commented 2 years ago

Hi @kpavel.

I tried your script on a GCP cluster but could not reproduce the issue -- the autoscaler correctly started a bunch of nodes, even though it was all at once.

Could you provide more information so I can reproduce the issue? What cloud/environment are you running in? Does "slow" autoscaling work fine? Does this crash occur every time you run the script?

Hi @cadedaniel. Did you try with same, 40, number of nodes? It works fine for me scaling up from 0 to 20 nodes and sometimes even for 30 nodes all at once, but when trying with higher numbers, like 40 nodes all at once, it crashes.

hora-anyscale commented 2 years ago

Per Triage Sync: @cadedaniel can you confirm you tested with 40 nodes?

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.