Tracking: high cardinality metrics

risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.

https://go.risingwave.com/slack

Apache License 2.0

6.99k stars 575 forks source link

Tracking: high cardinality metrics #18108

Closed kwannoel closed 4 weeks ago

kwannoel commented 2 months ago

The metrics tracked here need to be fine-tuned as they have too high granularity. Either they need to be changed from actor to fragment level, or a backup metric in MV level has to be provided, so that even if actor/fragment level metrics are too much, we can still have MV-level observability.

[x] Actor input/output blocking time ratio https://github.com/risingwavelabs/risingwave/issues/18062
[ ] Actor input/output throughput
[ ] Executor Cache Memory Usage
[ ] Executor Barrier Align
[ ] Materialize Executor Cache
[ ] Merger Barrier Align
[ ] Executor Throughput
[ ] Actor Memory Usage

fuyufjh commented 2 months ago

Currently, our metrics were managed by LabelGuardedMetric structs, each corresponding to one timeseries i.e. a metrics with a specific set of labels. This design allows the Metrics to be correctly destructed on drop MV.

In order to make an actor-level metric to fragment-level metrics, we need to introduce an Arc<LabelGuardedMetric> among all the actors in a fragment. Particularly, lock e.g. Arc<Mutex<LabelGuardedMetric>> should be avoided because the Metrics structure provided by Prometheus client library is thread-safe.

Another thing worth to mention is that the avg in PromQL query, such as

https://github.com/risingwavelabs/risingwave/blob/1b770666b92cb0a197714904b955cc5c6c3c6d2f/grafana/risingwave-dev-dashboard.dashboard.py#L1064

won't work as expected, because the stream_actor_output_buffer_blocking_duration_ns will actually become a node-level sum if we simply use the approach of Arc<LabelGuardedMetric>. To solve this, perhaps an AVG-specific structure needs to be introduced to calculate the average when being collect().

fuyufjh commented 2 months ago

Another thing worth to mention is that the avg in PromQL query, such as

https://github.com/risingwavelabs/risingwave/blob/1b770666b92cb0a197714904b955cc5c6c3c6d2f/grafana/risingwave-dev-dashboard.dashboard.py#L1064

won't work as expected, because the stream_actor_output_buffer_blocking_duration_ns will actually become a node-level sum if we simply use the approach of Arc<LabelGuardedMetric>. To solve this, perhaps an AVG-specific structure needs to be introduced to calculate the average when being collect().

Oh, wait, I suddenly realized the approach above doesn't work as expected. For example, assuming we have a fragment with 8 parallelism:

fragment F:   CN A:  actor-0  actor-1  actor-2
              CN B:  actor-3  actor-4  actor-5
              CN C:  actor-6  actor-7

Here, avg of these 3 node-level avg is NOT the avg of these 8 actor, because the actors are not evenly distributed. If the distribution is more uneven, the problem will be more severe.

I am now thinking about another approach: splitting the avg to sum/count. sum can be easily aggregated locally by sharing a Counter. While for count, which means the parallelism of a fragment, we might need to introduce a new metrics and reuse it among all these listed metrics.

kwannoel commented 2 months ago

What about having these metrics aggregated at a fragment per worker level, rather than fragment level? Is that easier?

fuyufjh commented 2 months ago

What about having these metrics aggregated at a fragment per worker level, rather than fragment level? Is that easier?

How can you get other CNs' metrics in the current CN's code? RPC? 🤣

BugenZhao commented 2 months ago

we need to introduce an Arc<LabelGuardedMetric> among all the actors in a fragment.

Note that Arc only work within a single process, to avoid dropping the metric unexpectedly, I suppose we still need another label of worker_id for those metrics to differentiate between different worker nodes. Am I understanding correctly?

Regarding the PromQL query, the difference is that we lose the information about how many actors (in the same fragment) are there in each worker node. To avoid bias on the final result, especially when there's any skew in the scheduling of the actors, we need to perform some manual weighted averaging.

I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues.

fuyufjh commented 2 months ago

I suppose we still need another label of worker_id for those metrics to differentiate between different worker nodes. Am I understanding correctly?

No need, because the collector will add another instance label to the metrics.

I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues.

Agree. The approch is dirty and ugly, but @arkbriar strongly pushed so.

BugenZhao commented 2 months ago

No need, because the collector will add another instance label to the metrics.

Only in Cloud or k8s deployments, right?

fuyufjh commented 2 months ago

Only in Cloud or k8s deployments, right?

True, but I believe it's a common practice. Not all systems assign node_id like RisingWave, so I suppose they must need such an additional label.

arkbriar commented 2 months ago

I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues.

It is commonly acknowledged that high cardinality metrics will significantly impact the performance and availability of time series databases. If you think it's arbitrary, please provide a solution to keep the TSDB stable as there are a ridiculous amount of 0.6M time series emitting from one of the running compute nodes and I personally am not able to manage it.

fuyufjh commented 2 months ago

We've had many discussions about this issue. Overall, I vote +1 for it because our system can indeed support 10k-100k actors per node, while the monitoring stack especially timeseries databases can't hold so many metrics, or can host with much more cost than RW itself. 😂

Although it's not a perfect solution, I believe it's the best we can do under the circumstances. I'm also trying to minimize the invasion of the implementation, https://github.com/risingwavelabs/risingwave/issues/18108#issuecomment-2303836241 tries to offer a way to work for both the design is enabled or disabled.

fuyufjh commented 2 months ago

Half a year ago, I have raised my concerns and put aside the requirement. Related discussions can be found in the above issue. I vote +1 for it now after serious considerations and trade-off.

BugenZhao commented 2 months ago

I suppose we still need another label of worker_id for those metrics to differentiate between different worker nodes. Am I understanding correctly?

Update: this is automatically done by Prometheus: https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series