risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.98k stars 575 forks source link

discussion: limit the number of (stateful) actors per CN #16092

Open fuyufjh opened 6 months ago

fuyufjh commented 6 months ago

Motivation

Generally, when there are many (e.g. 10+) streaming jobs running in one RisingWave instance, it's no longer a good idea to use full CPU cores for all fragments. This proposal is trying to address this problem.

Recently, we found several issues related to the number of actors per CN:

  1. Local HummockUploader failed to make the new version in one checkpoint_interval, which caused barriers to pile up. Fixed by https://github.com/risingwavelabs/risingwave/pull/15931.
  2. Actor-level metrics sometimes can be too heavy for Prometheus https://github.com/risingwavelabs/risingwave/issues/14821
  3. Due to many reasons, there is a fixed memory front print for each actor. The more actors, the less memory space for our streaming cache. In longevity test, the cached data is nearly zero.

Design

I think we could introduce a soft limit and a hard limit for each CN

In the notice message, the users are encouraged to use the alter command to set a smaller parallelism on existing streaming jobs.

Implementation

The implementation is trivial, but we need to carefully pick a default threshold. For example,

actors_soft_limit_per_core = 100
actors_hard_limit_per_core = 200

TBD

lmatz commented 5 months ago

link: https://github.com/risingwavelabs/risingwave/issues/15668 not super related, but it builds on the same presumption that: "it's no longer a good idea to use full CPU cores for all fragments"

st1page commented 5 months ago

How about directly limit the number of the physical state table instance?

github-actions[bot] commented 2 months ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄