Incredible many metrics exported from meta and compute nodes

risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.

https://go.risingwave.com/slack

Apache License 2.0

6.89k stars 570 forks source link

Incredible many metrics exported from meta and compute nodes #14821

Closed arkbriar closed 3 months ago

arkbriar commented 8 months ago

Describe the bug

As title. Top ones:

From meta, metrics.txt

31072 actor_info
2994 storage_version_stats
1206 table_info

From compute metrics.compute.txt

18229 stream_actor_output_buffer_blocking_duration_ns
11568 block_efficiency_histogram_bucket
10375 stream_actor_input_buffer_blocking_duration_ns
10375 stream_actor_in_record_cnt
10359 stream_actor_out_record_cnt
4698 state_store_sst_store_block_request_counts
4576 stream_join_barrier_align_duration_bucket
3294 stream_executor_row_count
3132 state_store_iter_scan_key_counts
2929 stream_join_matched_join_keys_bucket
2467 stream_memory_usage
2467 lru_evicted_watermark_time_ms
2349 state_store_read_req_positive_but_non_exist_counts
2349 state_store_read_req_check_bloom_filter_counts
2349 state_store_read_req_bloom_filter_positive_counts

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20240125-disable-embedded-tracing

Additional context

No response

fuyufjh commented 7 months ago

31072 actor_info

The number "31072" reminds me of the case that the longevity test runs a lot of actors parallelly. If I remember correctly, the total number of actors is exactly 31072. The workload of longevity test might be a little bit extreme, but this is what it's designed to do.

Other metrics in CN are also because of the big amount of actors and tables.

For now, I don't have any better ideas to reduce the size. Recording metrics at actor level sounds totally reasonable to me.

fuyufjh commented 7 months ago

Particularly regarding of actor_info, this is a "dummy" metrics storing actor information as labels and the value is always 1. This is because in Prometheus data model, this seems to be the only way to a table. Without it, one needs to access the RisingWave psql endpoint, which might not be always available.

arkbriar commented 7 months ago

Recording metrics at actor level sounds totally reasonable to me.

It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around.

xxchan commented 7 months ago

Recording metrics at actor level sounds totally reasonable to me.

It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around.

Do we have any idea about when it will become a problem? e.g., what's the current pressure of our prometheus?

From here I see:

A typical Prometheus server can handle on the order of 10 million metrics before you start to see limitations

fuyufjh commented 6 months ago

Recording metrics at actor level sounds totally reasonable to me.

It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around.

That's true, but at the moment I think the root problem is becoming: Why there are so many actors?

It's possible to aggregate the metrics by fragment before collecting, although I think it's not a best practice. By definition, an actor is a basic unit of execution. You can understand it as a worker thread in normal applications.

arkbriar commented 6 months ago

A typical Prometheus server can handle on the order of 10 million metrics before you start to see limitations

I'm pretty much sure that it's not the case. Prometheus' implementation is notorious for huge memory consumption. You can find a lot of criticisms online.

arkbriar commented 6 months ago

That's true, but at the moment I think the root problem is becoming: Why there are so many actors?

I'm sure that the problem is we are not expected to record actor level metrics with Prometheus, or any other kind of TSDB. Unlike AP systems, they are not made for dealing with high cardinality data.

Quote from https://prometheus.io/docs/practices/naming/

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

xxchan commented 6 months ago

So do we have any idea about how high is "high cardinality data" (Is 1k or 10k acceptable?)? I'm thinking that the examples (such as user IDs, email addresses) are definitely high cardinality, but the number of actors is a little controversial. 🤔

xxchan commented 6 months ago

But one user do suffer from performance issue of Prometheus (Grafana slowness and missing metrics).

They have quite many actors.

select worker_id, count(*) from rw_actors a, rw_parallel_units p where a.parallel_unit_id = p.id group by p.worker_id;

worker_id|count|
---------+-----+
    26003|15374|
    26002|11664|
    26004| 7920|

arkbriar commented 6 months ago

Is 1k or 10k acceptable?

Acceptable as long as it won't change over the time. Which is to say, [0, 10000) forever is alright. But a dynamic range [x, x+10000) where x changes over time isn't.

Regarding cardinality, it stands for the number of label values over a considerable long period, which is quite different from other DB systems.

github-actions[bot] commented 4 months ago

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

fuyufjh commented 1 month ago

New progress: https://github.com/risingwavelabs/risingwave/issues/18108