kpavel commented 2 years ago

What happened + What you expected to happen

When autoscaling fast, with upscaling_speed: 99999, from 0 to 40 worker nodes, the head node raylet process crashes without logging the failure anywhere. The head node is not running out of resources at any point: 32 cpu + 128GB ram

Versions / Dependencies

Happened on previous version and continues to occur with latest ray==1.13.0

Reproduction script

In cluster config file set upscaling_speed: 999999 max_workers: 40 min_workers: 0
Spawn cluster and start load to make autoscaler start all 40 workers in one batch
```
import ray
import sys
import time
import socket
```

ray.init(address='ray://141.125.161.44:10001')

@ray.remote(num_cpus=0.2) def exec_cmd(cmd_line):

cmd_line_split = cmd_line.strip().split(' ')
begin = time.time()
try:
    print("cmmm:" + str(cmd_line_split))
    end=time.time()
    duration=end-begin

print(stderr)

    time.sleep(20)
    return("abc" + str(time.time()))

except socket.timeout:
    print('done')

result_ptr_list = []

for i in range(1000): result_pr = exec_cmd.remote(str(i)) result_ptr_list.append(result_pr) results = ray.get(result_ptr_list) print(results)



Logs:
[TMP_RAY_FOLDER.tar.gz](https://github.com/ray-project/ray/files/8899391/TMP_RAY_FOLDER.tar.gz)

### Issue Severity

High: It blocks me from completing my task.

richardliaw commented 2 years ago

@cadedaniel could you do a bit of investigation on this one?

cadedaniel commented 2 years ago

Hi @kpavel.

I tried your script on a GCP cluster but could not reproduce the issue -- the autoscaler correctly started a bunch of nodes, even though it was all at once.

Could you provide more information so I can reproduce the issue? What cloud/environment are you running in? Does "slow" autoscaling work fine? Does this crash occur every time you run the script?

kpavel commented 2 years ago

Hi @kpavel.

I tried your script on a GCP cluster but could not reproduce the issue -- the autoscaler correctly started a bunch of nodes, even though it was all at once.

Could you provide more information so I can reproduce the issue? What cloud/environment are you running in? Does "slow" autoscaling work fine? Does this crash occur every time you run the script?

Hi @cadedaniel. Did you try with same, 40, number of nodes? It works fine for me scaling up from 0 to 20 nodes and sometimes even for 30 nodes all at once, but when trying with higher numbers, like 40 nodes all at once, it crashes.

hora-anyscale commented 2 years ago

Per Triage Sync: @cadedaniel can you confirm you tested with 40 nodes?

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

ray-project / ray

[Core/Autoscaler] High upscaling causes raylet process crash #25760

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

print(stderr)