[core] More than 30s scheduling delay in embarrassingly parallel task workload after scaling from 100 to 300 cores

stephanie-wang commented 2 years ago

What happened + What you expected to happen

Hi Ray team, I’m experiencing some scalability issue when I’m testing my cluster with more nodes. Here’s a bit of context.

ray version: 1.9
workload: 300 tasks each of which takes about a minute to complete in a single core
cluster: around 300 cores running on a k8s platform. Each node is configured with 15 cores. Before I run my test, all nodes are brought up and all 300 cores are available.
execution pattern: I use a very simple parallel map pattern that sends out 300 tasks at once and wait for them all.

performance
    100 cores:
        this is the case that time can add up. For example, 300 tasks (each task takes about 1 min on a single core) in total take less than 3 min to finish.
        in addition, from Ray Dashboard, I see all nodes are taken up quickly.
    300 cores:
        when I increase my cores to 300, I’m expecting the total amount of time is around 1 min. However, it takes more than 2 min to finish.
        In this case, from Ray Dashboard, I see tasks are sent to worker much slower than the case of 100 cores and at most it takes around 200 cores at the same time.
        I also tried getting timeline stat by setting RAY_PROFILINEG=1 and the timeline tells a similar story as what I observed from dashboard

        Notice that some workers are scheduled much much later than others, though I send all tasks at the same time and the cluster has enough cores.

I also tried “gang scheduling” following instructions from this page using placement group (I’m using ray 1.9) : Placement Group Examples — Ray v1.9.2 but it doesn’t make much difference.

We found this issue when we did our final regression test before first time go-live with Ray, and it’s now a blocker. I would be really appreciate if any one could provide any suggestion. I’m glad to provide any further information if needed.

Versions / Dependencies

1.9, need reproduction on 1.13 and master.

Reproduction script

def price(...):
    ...
    pricing_tasks = list(
        map(lambda batch: price_remote.remote(batch), pricing_batches)
    )
    return ray.get(pricing_tasks)

Issue Severity

High: It blocks me from completing my task.

blshao84 commented 2 years ago

Quick update: tried ray 1.13 and adding scheduling_strategy='SPREAD' solved the issue. Now the performance matrix under 300 cores makes much sense now. For example, from dashboard I could the whole grid was taken up within 1-2 seconds, compared with 20-30 seconds without the fix. We are still doing more testing.

stephanie-wang commented 2 years ago

I confirmed this issue on nightly wheels. Looks like it's an issue at startup time. If you run 300 more tasks after the initial round, they are all scheduled within one second. Let's try to get this fixed before the 2.0 release.

hora-anyscale commented 2 years ago

Per Triage Sync: @stephanie-wang is there a timeline to address this?

stephanie-wang commented 2 years ago

I believe @jjyao is taking this over and is trying out a fix in the next week or two.

zhe-thoughts commented 1 year ago

@ericl I think this is more of a P1 (to be fixed in the next release, but not quite warrants a hot release). Thoughts?

ericl commented 1 year ago

I think it's actually a P0, but agree it's not a release blocker.

clarng commented 1 year ago

Core oncall : gentle ping on update given its P0

scv119 commented 1 year ago

should be addressed by #https://github.com/ray-project/ray/pull/31868 and https://github.com/ray-project/ray/pull/31934 we have seen our embarrassing parallel release test test_many_tasks scheduling speed increased by 8x, from 25/s to 200/s

ray-project / ray