metrics-rs / metrics

A metrics ecosystem for Rust.
MIT License
1.06k stars 143 forks source link

Different metrics lifetime and garbage collection (forgetting outdated telemetry) #495

Open 3g0r opened 1 week ago

3g0r commented 1 week ago

Hi, in my case I have many spawned tokio tasks that need to be measured. Measurements for these spawned tasks unique by labels, and once solved I have to remove these measurements from the metrics registry to prevent memory leaks. At the same time I need to keep metric COUNT_OF_ACTIVE_TASKS available while my program works.

At now I can't find any way for solving that problem using current API.

builder.idle_timeout looks good, but I have no guarantees about the interval for spawning new tasks, hence COUNT_OF_ACTIVE_TASKS could be deleted at any time and its state forgotten.

Can anyone tell me how to solve this problem without writing an absolute value to COUNT_OF_ACTIVE_TASKS on timeout in an infinite loop? 😂

3g0r commented 1 week ago

I was thinking about collecting metrics from TCP connections - we have no guarantees about the intervals between packets in general. For example, if we collect the number of bits sent, but there are no ping messages in the protocol between the client and server, we run the risk of forgetting the state of the metrics if the client and server are silent for a long time.

So, I think we really need to extend api. For example add ::mark_as_outdated() method to counter/histogram/gauge, or may be extend recorder api to add ::remove_<metric kind>(), or give direct access to registry.

tobz commented 1 week ago

Yeah, in general, there's no good ergonomic way to let callers (the parts of the code actually emitting the metrics) control when those metrics go away.

This will likely need to be solved through whatever we do to fix #314, since fixing that allows for a better separation between "this metric is no longer live at all" and "this metric hasn't been updated in a while and I want to stop showing it".

3g0r commented 1 week ago

"this metric hasn't been updated in a while and I want to stop showing it".

Do we really need this feature?

I think that if we suppress some measurements, Prometheus can't collect them, and Grafana will actually render the gaps in those time slots while the measurements are suppressed for rendering in our app.

My be we can do it more simple if just delegate the responsibility of deleting metrics to users?

At least in other programming languages I have been happy with such an api so far.

tobz commented 1 week ago

We're not going to be changing the core Recorder API to allow for arbitrarily marking a metric as done/outdated/expired.

As far as wanting to stop showing idle metrics: it's absolutely a thing people want/request. It's very useful to avoid removing a metric as soon as it's no longer used, but instead only after a long enough period of inactivity, in order to avoid sparse reporting.