Closed stephanie-wang closed 1 year ago
Quick update: tried ray 1.13 and adding scheduling_strategy='SPREAD' solved the issue. Now the performance matrix under 300 cores makes much sense now. For example, from dashboard I could the whole grid was taken up within 1-2 seconds, compared with 20-30 seconds without the fix. We are still doing more testing.
I confirmed this issue on nightly wheels. Looks like it's an issue at startup time. If you run 300 more tasks after the initial round, they are all scheduled within one second. Let's try to get this fixed before the 2.0 release.
Per Triage Sync: @stephanie-wang is there a timeline to address this?
I believe @jjyao is taking this over and is trying out a fix in the next week or two.
@ericl I think this is more of a P1 (to be fixed in the next release, but not quite warrants a hot release). Thoughts?
I think it's actually a P0, but agree it's not a release blocker.
Core oncall : gentle ping on update given its P0
should be addressed by #https://github.com/ray-project/ray/pull/31868 and https://github.com/ray-project/ray/pull/31934 we have seen our embarrassing parallel release test test_many_tasks scheduling speed increased by 8x, from 25/s to 200/s
What happened + What you expected to happen
Hi Ray team, I’m experiencing some scalability issue when I’m testing my cluster with more nodes. Here’s a bit of context.
I also tried “gang scheduling” following instructions from this page using placement group (I’m using ray 1.9) : Placement Group Examples — Ray v1.9.2 but it doesn’t make much difference.
We found this issue when we did our final regression test before first time go-live with Ray, and it’s now a blocker. I would be really appreciate if any one could provide any suggestion. I’m glad to provide any further information if needed.
Versions / Dependencies
1.9, need reproduction on 1.13 and master.
Reproduction script
Issue Severity
High: It blocks me from completing my task.