Closed kwannoel closed 4 weeks ago
Currently, our metrics were managed by LabelGuardedMetric
structs, each corresponding to one timeseries i.e. a metrics with a specific set of labels. This design allows the Metrics to be correctly destructed on drop MV
.
In order to make an actor-level metric to fragment-level metrics, we need to introduce an Arc<LabelGuardedMetric>
among all the actors in a fragment. Particularly, lock e.g. Arc<Mutex<LabelGuardedMetric>>
should be avoided because the Metrics structure provided by Prometheus client library is thread-safe.
Another thing worth to mention is that the avg
in PromQL query, such as
won't work as expected, because the stream_actor_output_buffer_blocking_duration_ns
will actually become a node-level sum if we simply use the approach of Arc<LabelGuardedMetric>
. To solve this, perhaps an AVG-specific structure needs to be introduced to calculate the average when being collect()
.
Another thing worth to mention is that the
avg
in PromQL query, such aswon't work as expected, because the
stream_actor_output_buffer_blocking_duration_ns
will actually become a node-level sum if we simply use the approach ofArc<LabelGuardedMetric>
. To solve this, perhaps an AVG-specific structure needs to be introduced to calculate the average when beingcollect()
.
Oh, wait, I suddenly realized the approach above doesn't work as expected. For example, assuming we have a fragment with 8 parallelism:
fragment F: CN A: actor-0 actor-1 actor-2
CN B: actor-3 actor-4 actor-5
CN C: actor-6 actor-7
Here, avg of these 3 node-level avg is NOT the avg of these 8 actor, because the actors are not evenly distributed. If the distribution is more uneven, the problem will be more severe.
I am now thinking about another approach: splitting the avg
to sum/count
. sum
can be easily aggregated locally by sharing a Counter
. While for count
, which means the parallelism of a fragment, we might need to introduce a new metrics and reuse it among all these listed metrics.
What about having these metrics aggregated at a fragment per worker
level, rather than fragment
level? Is that easier?
What about having these metrics aggregated at a
fragment per worker
level, rather thanfragment
level? Is that easier?
How can you get other CNs' metrics in the current CN's code? RPC? 🤣
we need to introduce an
Arc<LabelGuardedMetric>
among all the actors in a fragment.
Note that Arc
only work within a single process, to avoid dropping the metric unexpectedly, I suppose we still need another label of worker_id
for those metrics to differentiate between different worker nodes. Am I understanding correctly?
Regarding the PromQL query, the difference is that we lose the information about how many actors (in the same fragment) are there in each worker node. To avoid bias on the final result, especially when there's any skew in the scheduling of the actors, we need to perform some manual weighted averaging.
I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues.
I suppose we still need another label of worker_id for those metrics to differentiate between different worker nodes. Am I understanding correctly?
No need, because the collector will add another instance
label to the metrics.
I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues.
Agree. The approch is dirty and ugly, but @arkbriar strongly pushed so.
No need, because the collector will add another
instance
label to the metrics.
Only in Cloud or k8s deployments, right?
Only in Cloud or k8s deployments, right?
True, but I believe it's a common practice. Not all systems assign node_id
like RisingWave, so I suppose they must need such an additional label.
I feel like we are essentially doing some sort of pre-aggregation that Prometheus could do on its own. I'm just wondering if it's really a common practice to trade-off development complexity for performance, or we are too arbitrary to reject high-cardinality metrics as a solution to performance issues.
It is commonly acknowledged that high cardinality metrics will significantly impact the performance and availability of time series databases. If you think it's arbitrary, please provide a solution to keep the TSDB stable as there are a ridiculous amount of 0.6M time series emitting from one of the running compute nodes and I personally am not able to manage it.
We've had many discussions about this issue. Overall, I vote +1 for it because our system can indeed support 10k-100k actors per node, while the monitoring stack especially timeseries databases can't hold so many metrics, or can host with much more cost than RW itself. 😂
Although it's not a perfect solution, I believe it's the best we can do under the circumstances. I'm also trying to minimize the invasion of the implementation, https://github.com/risingwavelabs/risingwave/issues/18108#issuecomment-2303836241 tries to offer a way to work for both the design is enabled or disabled.
Half a year ago, I have raised my concerns and put aside the requirement. Related discussions can be found in the above issue. I vote +1 for it now after serious considerations and trade-off.
I suppose we still need another label of
worker_id
for those metrics to differentiate between different worker nodes. Am I understanding correctly?
Update: this is automatically done by Prometheus: https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series
The metrics tracked here need to be fine-tuned as they have too high granularity. Either they need to be changed from actor to fragment level, or a backup metric in MV level has to be provided, so that even if actor/fragment level metrics are too much, we can still have MV-level observability.