Closed arkbriar closed 3 months ago
31072 actor_info
The number "31072" reminds me of the case that the longevity test runs a lot of actors parallelly. If I remember correctly, the total number of actors is exactly 31072. The workload of longevity test might be a little bit extreme, but this is what it's designed to do.
Other metrics in CN are also because of the big amount of actors and tables.
For now, I don't have any better ideas to reduce the size. Recording metrics at actor level sounds totally reasonable to me.
Particularly regarding of actor_info
, this is a "dummy" metrics storing actor information as labels and the value is always 1. This is because in Prometheus data model, this seems to be the only way to a table. Without it, one needs to access the RisingWave psql
endpoint, which might not be always available.
Recording metrics at actor level sounds totally reasonable to me.
It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around.
Recording metrics at actor level sounds totally reasonable to me.
It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around.
Do we have any idea about when it will become a problem? e.g., what's the current pressure of our prometheus?
From here I see:
A typical Prometheus server can handle on the order of 10 million metrics before you start to see limitations
Recording metrics at actor level sounds totally reasonable to me.
It is until there are too many actors. The number amplifies when there are more than one nodes and the actors scheduled around.
That's true, but at the moment I think the root problem is becoming: Why there are so many actors?
It's possible to aggregate the metrics by fragment before collecting, although I think it's not a best practice. By definition, an actor is a basic unit of execution. You can understand it as a worker thread in normal applications.
A typical Prometheus server can handle on the order of 10 million metrics before you start to see limitations
I'm pretty much sure that it's not the case. Prometheus' implementation is notorious for huge memory consumption. You can find a lot of criticisms online.
That's true, but at the moment I think the root problem is becoming: Why there are so many actors?
I'm sure that the problem is we are not expected to record actor level metrics with Prometheus, or any other kind of TSDB. Unlike AP systems, they are not made for dealing with high cardinality data.
Quote from https://prometheus.io/docs/practices/naming/
CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
So do we have any idea about how high is "high cardinality data" (Is 1k or 10k acceptable?)? I'm thinking that the examples (such as user IDs, email addresses) are definitely high cardinality, but the number of actors is a little controversial. 🤔
But one user do suffer from performance issue of Prometheus (Grafana slowness and missing metrics).
They have quite many actors.
select worker_id, count(*) from rw_actors a, rw_parallel_units p where a.parallel_unit_id = p.id group by p.worker_id;
worker_id|count|
---------+-----+
26003|15374|
26002|11664|
26004| 7920|
Is 1k or 10k acceptable?
Acceptable as long as it won't change over the time. Which is to say, [0, 10000) forever is alright. But a dynamic range [x, x+10000) where x changes over time isn't.
Regarding cardinality, it stands for the number of label values over a considerable long period, which is quite different from other DB systems.
This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.
Describe the bug
As title. Top ones:
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
nightly-20240125-disable-embedded-tracing
Additional context
No response