risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.95k stars 573 forks source link

tracking: refactor metrics with `LabelGuarded` #14838

Open fuyufjh opened 8 months ago

fuyufjh commented 8 months ago

Background

LabelGuardedMetricVec was introduced in #13080. It enhances the MetricVec to ensure the set of labels to be correctly removed from the Prometheus client once being dropped. This is useful for metrics that are associated with an object that can be dropped, such as streaming jobs, fragments, actors, batch tasks, etc.

When a set labels is dropped, it will record it in the uncollected_removed_labels set. Once the metrics has been collected, it will finally remove the metrics of the labels.

To-dos

Technically, all usages of plain MetricVec of a drop-able object (streaming jobs, fragments, actors, batch tasks, etc.) need to be replaced with LabelGuardedMetricVec

BugenZhao commented 7 months ago

Could be related: #13086

fuyufjh commented 7 months ago

related https://github.com/risingwavelabs/risingwave/issues/14821

xxchan commented 7 months ago

So currently when a streaming job is dropped, it's metrics will be leaked (i.e., prometheus collected some useless data, which is always zero valued), right?

fuyufjh commented 7 months ago

So currently when a streaming job is dropped, it's metrics will be leaked (i.e., prometheus collected some useless data, which is always zero valued), right?

True. Part of them have been fixed (for example, check the StreamingMetrics). Anyone taking this issue please help to check whether the remaining usage are correct.

lmatz commented 7 months ago

which is always zero valued

I have observed some non-zero constant values on the Grafana, although I am not sure if it is the same root cause

xxchan commented 7 months ago

Yes, should be constant. Not necessarily zero.

Can examine by checking localhost:1222. e.g., stream_mview_input_row_count for a dropped actor