risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7.02k stars 577 forks source link

Discussion: give users the choice to schedule all the actors of a streaming job to the same CN #15668

Open lmatz opened 7 months ago

lmatz commented 7 months ago

Correct me if I am wrong, but I believe right now, the meta node tries to schedule the actors/parallel units(not sure which is the correct terminology) of a streaming job to all the CNs as evenly as possible.

As explained in https://www.notion.so/risingwave-labs/RFC-Implement-Shared-Nothing-Local-Compaction-7f58ebd9022345e69eb308b906090b3f?pvs=4:

  1. Inefficiencies come from the fact that compaction maintains global key ordering in each non-overlapping level while we are using hash partition for data distribution
  2. Compute node read IOPS and read amplification for fresh data is proportional to cluster size, not data volume

I think we can deduce from these two claims that if we schedule all the actors of a streaming job to the same CN, then the inefficiencies do not exist anymore. Because for this job, it is essentially running in a 1-CN cluster. (Not necessarily 1, but the fewer CNs, the better)

In real-life workloads, we often observe that users have a large number of streaming jobs and a large percentage of them do NOT want to occupy all the parallel units, i.e. streaming_parallelism << the maximum it can use. In such cases, we have the opportunity (but still be able to fully utilize all the resources because we have a large number of streaming jobs) to schedule each of them to a single CN to process, which increases efficiency.

If we are able to manually schedule in such a way right now, we can do experiments, especially with a real workload, to verify if the improvement is decent enough.

Welcome comments!

BugenZhao commented 7 months ago

If we are able to manually schedule in such a way right now, we can do experiments, especially with a real workload, to verify if the improvement is decent enough.

A simple example could be comparing the performance of 6 actors on a single machine with that of 2 actors on 3 machines, by specifying --parallelism on start-up. And the result should be pretty predictable that single node outperforms as there's no remote shuffling involved and the cache locality is better.

However, I believe this only holds when there're sufficient resources for a single machine to run 6 actors. What if the business needs a significant scale, say, requiring 600 or 6000 cores? In such cases we have no choice but to schedule the job in a distributed manner.

lmatz commented 7 months ago

And the result should be pretty predictable that single node outperforms as there's no remote shuffling

This is a good point, we have one more reason to make the actors of a job as colocated as possible.

What if the business needs a significant scale, say, requiring 600 or 6000 cores? In such cases we have no choice but to schedule the job in a distributed manner.

We keep the existing option, i.e. schedule all the actors of a job as evenly as possible. Maybe the title of issue gives a false impression And we also give users the choice to schedule a job to a specific subset of compute nodes.

Per my experience, if a job requires 600 or 6000 cores, that job is likely to be taken good care of and occupy a dedicated cluster. This is a special case in the sense:

  1. we are mainly working with SMB and mid-market at the moment. A single job that requires 600~6000 cores is more likely to exist at tech giants.
  2. 600 or 6000 cores are powered by a decent number of compute nodes, the performance is likely to suffer due to the storage overhead mentioned in the notion. It's hard to say whether RW is ready to tackle such a large-scale scenario and stay competitive.

Besides, among existing use cases, (1) the parallelism of a single job is << the sum of the parallelism of all the jobs, which gives us room to rearrange the actors so that:

  1. each job is scheduled on as few compute nodes as possible while satisfying its parallelism requirement.
  2. each compute node still has roughly the same workload(CPU/IO consumption) to process.

There does exist a handful of jobs with the maximum parallelism(the total number of CPUs in the cluster). But due to (1), they, in reality, occupy a percentage of the total resources, i.e. maximum parallelism * total CPUs / the sum of the parallelism of all the jobs.

By properly scheduling these maximum parallelism jobs to a subset of compute nodes (which gives some less parallelism on paper), they may in fact (1) acquire more resources (as fewer jobs share the resources with them) (2) get less interference by the shuffle overhead and the storage overhead.

lmatz commented 5 months ago

I think it has the potential to be a good cloud-only feature, and the selling point is to:

  1. automatically set the proper streaming_parallelism for each job
  2. automatically schedule/bin-pack the jobs among all the compute nodes
github-actions[bot] commented 3 months ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄