Open kpavel opened 2 years ago
@cadedaniel could you do a bit of investigation on this one?
Hi @kpavel.
I tried your script on a GCP cluster but could not reproduce the issue -- the autoscaler correctly started a bunch of nodes, even though it was all at once.
Could you provide more information so I can reproduce the issue? What cloud/environment are you running in? Does "slow" autoscaling work fine? Does this crash occur every time you run the script?
Hi @kpavel.
I tried your script on a GCP cluster but could not reproduce the issue -- the autoscaler correctly started a bunch of nodes, even though it was all at once.
Could you provide more information so I can reproduce the issue? What cloud/environment are you running in? Does "slow" autoscaling work fine? Does this crash occur every time you run the script?
Hi @cadedaniel. Did you try with same, 40, number of nodes? It works fine for me scaling up from 0 to 20 nodes and sometimes even for 30 nodes all at once, but when trying with higher numbers, like 40 nodes all at once, it crashes.
Per Triage Sync: @cadedaniel can you confirm you tested with 40 nodes?
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
What happened + What you expected to happen
When autoscaling fast, with upscaling_speed: 99999, from 0 to 40 worker nodes, the head node raylet process crashes without logging the failure anywhere. The head node is not running out of resources at any point: 32 cpu + 128GB ram
Versions / Dependencies
Happened on previous version and continues to occur with latest ray==1.13.0
Reproduction script
In cluster config file set upscaling_speed: 999999 max_workers: 40 min_workers: 0
Spawn cluster and start load to make autoscaler start all 40 workers in one batch
ray.init(address='ray://141.125.161.44:10001')
@ray.remote(num_cpus=0.2) def exec_cmd(cmd_line):
print(stderr)
result_ptr_list = []
for i in range(1000): result_pr = exec_cmd.remote(str(i)) result_ptr_list.append(result_pr) results = ray.get(result_ptr_list) print(results)