Stop removing __name__ from results: prevent `vector cannot contain metrics with the same labelset`

samjewell commented 1 year ago

This issue was previously named as:

Identify highest DPM series via PromQL

Problem / Use-Case

As an Admin/Manager/Operator of a Prometheus instance one use-case is to understand, then manage-down the usage on the instance and the cost of running it. Sometimes there are a few metrics with very high DPM (ie. high frequency of scraping / small scrape-interval), which increase "usage" on the instance, and increase costs. For a multi-tenant project which implements PromQL (such as Mimir) this appears as a DPM overage and increase in the dollar costs for that tenant. The Admin then needs to identify which metrics are responsible for the high DPM; which are the worst offenders. I've tried to find these with the following query:

topk(10, count_over_time({__name__!=""}[1m]))

Unfortunately this fails with:

execution: vector cannot contain metrics with the same labelset

Boiling this example down a bit, this error is present with the query count_over_time({__name__!=""}[1m]). And it's present for my small instance which only has 4,200 metrics. Google suggests to me a typical workaround for this error is to use label_replace to move the original metric-name into a temporary label. So incorporating this fix I get to this query:

topk(10, count_over_time(label_replace({__name__!=""},"name_label","$1","__name__", "(.+)")[1m:]))

But I've now introduced another problem. (Can you spot it?) I had to switch from [1m] to [1m:] ie. convert to a subquery, to avoid parse error: ranges only allowed for vector selectors. But by switching to a subquery, count_over_time no longer counting the DPM of the underlying metrics/series, but instead counts the subquery-evaluations during the interval in question!

So my question is this: How can I identify the series that have the highest DPM via PromQL, so I can effectively manage cost and usage on my instance? (Or if it weren't via PromQL, how would we change Prometheus to better support this use-case, and how would operators of multi-tenant Prometheus services such as Mimir support their tenants to achieve the same).

Can we make changes to Prometheus to better support this use-case?

Note: I'm aware that the query I'm trying to run can be very computationally expensive, and may fail if the result-set is too big. But I'm not aware of alternatives for the case I'm describing.

Proposal / Ideas

I believe vector cannot contain metrics with the same labelset happens because Prometheus removes the metric-name when performing count_over_time. It does this for any function which changes the 'dimensions' of the metric. This is an opinionated decision taken by the project in the past - I think the opinion is that 'no metric name is better than the wrong metric name', and metrics names typically do include dimensions (such as …_seconds). But personally (as an instance admin, and as someone who builds features for the instance-admin persona) I've found myself hitting this error message a few times, and found it a barrier to learning and using Prometheus.

Some ideas then, to start some discussion:

Somehow always stop users from hitting the error vector cannot contain metrics with the same labelset, while also retaining the rule that "we mustn't show the wrong metric name".
- Retain the metric name, but add a suffix "…DIMS_CHANGED" (either under __name__ or possibly a different label-key, such as original_name)
Somehow always stop users from the hitting the error parse error: ranges only allowed for vector selectors when they have been forced to use label_replace (by the metric-name being removed). ie. Have Prometheus consider label_replace not as a true part of the query which needs to be evaluated, but as a simple rename which can happen somehow differently at the start/end of the querying process...? I lack the words to describe this as I don't know the codebase, but I discussed this with @beorn7 a few months ago, and he thought it was a possibility (although maybe not until Prom v3).
Could we implement the keep_metric_names option that VictoriaMetrics has implemented?
“extend the cardinality API to cover time ranges” (Bryan Boreham’s suggestion - I think for a Mimir implementation)

beorn7 commented 1 year ago

Thanks, @samjewell. As you might have guessed from our personal communication, this touches two areas I have been thinking a bit about in the past, but never refined my thoughts enough to propose them in a design doc or to the dev summit. Let's take your issue as an opportunity to lay them out in their current rough form:

The first is essentially what you have proposed as your first idea above: Don't drop the name but prefix or suffix it somehow. My thought was to use a prefix that expresses the operation, and also prefix the prefix with a double underscore to mark the internal nature. In your example, the prefix would come out as something like __count_over_time:the_original_metricname.

Playing devil's advocate, here are concerns (with counter arguments in parentheses):

It's kind of verbose and clunky. And those artificial names are visible in the query results. Dashboards that use the bare labelset in the legend will now see the clunky names. (Yeah, that would render the whole idea a breaking change. A feature flag could mitigate that, though.)
Name collisions will still happen if the query is used in a recording rule because the name then changes to the name of the rule for all metrics in the result vector. (That's true, but then just don't use it in a recording rule. Further thought: Maybe even create a placeholder syntax to fill in the original metric name when naming the output of recording rules.)
How to deal with binary operators? (Vector to vector matching requires unique labelsets anyway. Vector to scalar could be mapped into the proposed system, e.g. __2_times:prometheus_build_info.)

The second thought is about using the timestamp of the evaluation even when the operation is essentially "in place", only affecting a single sample, related to your second idea. It's kind of funny how timestamp(prometheus_build_info) gives you actual timestamps from the TSDB, but timestamp(+prometheus_build_info) gives you the evaluation timestamp. Maybe we would only really need to pick the evaluation timestamp when we are actually aggregating samples with different timestamps. This would be a huge change to the evaluation model and certainly only something to consider for Prometheus v3. I would also not really consider it a good solution to your problem here, but this has come up quite often in different contexts, e.g. timestamp(last_over_time(prometheus_build_info[10m])) or generally to enable range selectors on this whole class of "non-aggregating operations", avoiding sub queries there.

About the ViMe option for that: Adding a separate option and thus mental overhead for everyone reading the docs for such a niche use case would make me really sad. If that first idea of yours (and mine) flew, it would be almost invisible for the naive user. Things would "just work".

beorn7 commented 1 year ago

Inspired by this issue, I had added an agenda item for the dev-summit many months ago. However, it wasn't discussed so far as other agenda items were deemed more pressing.

This makes it hard to go forward with any of the more invasive approaches. However, as already discussed in the dev-summit agenda item (search for "Revisit metric name removal during PromQL evaluation" in the dev summit document, sorry, no deep linking possible), there are less invasive approaches, and I think we can go forward with one of them in the meantime, to at least have some way of addressing the problem at hand.

The idea I propose here is to only drop the metric name as the very last step, before the result is returned to the outside world. This would not change the outcome of any working query, but it would allow to apply the label_replace trick later. @samjewell's problem could be solved by writing:

topk(10, label_replace(count_over_time({__name__!=""}[1m]), "name_label", "$1", "__name__", "(.+)"))

And in cases, where the ambiguity gets aggregated away anyway, formerly broken queries would just start to work. E.g. to answer the question how many metrics have had more than 10 samples in the last minute:

count(count_over_time({__name__!=""}[1m]) > 10)

The only trick is how to track that a metric name has to be removed before returning the result. But that's an implementation detail we should be able to solve.

samjewell commented 1 year ago

The Admin then needs to identify which metrics are responsible for the high DPM; which are the worst offenders. I've tried to find these with the following query: topk(10, count_over_time({__name__!=""}[1m]))

I've since found a better way to identify the sources of high resolution metrics (high datapoints per minute per series):

Query with sort_desc(count_over_time(scrape_samples_scraped[1m])) to get the data points per minute per series by target. This works a treat in most cases.

There's an exception for metrics which weren't created from a scrape at all, for example those created by recording rules. For those we'd still benefit from the improvement requested here.

beorn7 commented 1 year ago

I would like to note that the concrete use case that served as an example here is by far not the only one where users run into the dreaded vector cannot contain metrics with the same labelset error. I would still like to see a solution around the "drop names as the last step" idea. (And in Prom 3.x or something, I would like to see a solution around (not) changing the timetamps to the evaluation time.)

valyala commented 1 year ago

The idea I propose here is to only drop the metric name as the very last step, before the result is returned to the outside world.

This doesn't resolve the case when metric names should be returned in the response. For example, the following query could be used for returning resident and virtual memory usage:

max_over_time({__name__=~"process_(resident|virtual)_memory_bytes"}[5m])

Unfortunately, it doesn't work in Prometheus now and will not work after the proposed idea is implemented. The keep_metric_names modifier from VictoriaMetrics elegantly resolves this issue. Note that the keep_metric_names modifier doesn't complicate queries by default. Users may add it if they really want keeping metric names in the results of some PromQL or MetricsQL function.

beorn7 commented 1 year ago

Unfortunately, it doesn't work in Prometheus now and will not work after the proposed idea is implemented.

If the metric name is dropped as the last thing, you can still use label_replace to move the __name__ label into a normal label.

The keep_metric_names modifier from VictoriaMetrics elegantly resolves this issue.

I don't think it's a good idea. From the dev-summit notes: "Firstly, it exposes everybody to yet another knob they have to think about, while being only relevant for very few (the use case is a niche in a niche). Secondly, it ignores the problem that motivated the name removal in the first place (you are now creating misleading metric names)."

beorn7 commented 1 year ago

We discussed this topic during the last dev-summit. I try to summarize the results here:

About allowing `label_replace` to work on range vectors

This would solve most use cases in a seemingly easy way, but at a closer look, it comes with a lot of implications:

As already discussed, it would require overloaded parameter types (label_replace would have to work on both instant vectors and range vectors).
Worse: It would also require variable return types (label_replace acting on an instant vector returns an instant vector, a future label_replace acting on a range vector would need to return a range vector).
And finally, while currently an instant vector returned by a function has the evaluation time as its timestamp, the returned range vector had to retain all its timestamps.

Each of the above is something that we can very well discuss, and it might become a new feature in Prometheus 3.x, maybe being beneficial in many different ways, not just for this one feature request. However, to just solve the feature request at hand, it feels like a PromQL change way too fundamental and invasive.

A more moderate variant would be to create a new function label_replace_range_vector. A function returning a range vector with unmodified timestamps would still be a new concept, but it would be much less invasive change to PromQL. However, this would add a new function for a use case that is in a niche of a niche, but it would add mental overhead for everyone (adds to the documentation, and is leaving many with the question "What is this function even for?"). Same concern as for the ViMe approach above.

About removing the `name` label as the last step

Unfortunately, the discussion exposed that this would not be a transparent change. Certain "weird" queries would behave differently, so it would be technically a breaking change. Take the following query as an example:

sum by (__name__) (rate({foo="bar"}[5m]))

It is currently equivalent to sum(rate({foo="bar"}[5m])), while it would (ironically) create the vector cannot contain metrics with the same labelset error message with the proposed change. (The sum would be partitioned by the not-yet-removed __name__ label, which would then be removed, creating the labelset collision.) One might argue the query above is nonsensical, and while a human would probably never type such a query, queries are often created via some templating so that the no-op addition of by (__name__) can actually happen in practice, and we really cannot break those queries. (Especially autogenerated ones will wreak havoc in subtle ways.)

More generally, every query dealing with __name__ explicitly has the potential of behaving differently.

In that same line of thought, it should be noted that a __name__ label marked for removal has to be un-marked if it is explicitly overridden later via label_replace. This also allows to create a new meaningful name, e.g.

label_replace(rate({foo="bar"}[5m]), "__name__", "rate5m_of_$1", "__name__", "(.*)")

Overall, I don't think the caveats found kill the approach, but it has to stay behind a feature flag until the next major release after all, which reduces its elegance quite a bit. (We would have initially put it behind a feature flag anyway just to be cautious, but we could have removed the flag eventually if the change had been non-breaking.)

Pending any other new ideas or further evidence of implications, I would propose to give the "late __name__ removal" approach a try, even though it's now a bit more complicated than it appeared initially.

matthiasr commented 1 year ago

This may be getting out of hand in complexity, but could we detect the

aggregate by (__name__)
- function that used to remove __name__

stacking in the AST and rewrite the query to drop the __name__ aggregation? Would that always be compatible? Then we could drop this compatibility shim in 3.0 without having to gate all the good effects of this change.

09jvilla commented 8 months ago

+1 to this issue. I ran into a similar problem where I was trying to count samples ingested broken down by a label key of interest. Specifically, I have a label aws_account which gets applied to all my time series and helps me determine which time series are being scraped from which aws_account. I then want to count the rate of samples being sent from each account so I can detect if any single account suddenly starts massively increasing the sample rate it sends (e.g., maybe someone misconfigured something).

I was thinking I could do something like sum by(aws_account) (count_over_time( {__name__ =~ ".+" } ) ) but then I ran into the same execution: vector cannot contain metrics with the same labelset issue as above.

beorn7 commented 8 months ago

With v3.0 projected for 2024, the above annoyance about requiring a feature flag before 3.0 has become less bad, because 3.0 is going to happen quite soon rather than "at some point in the distant future". So it would indeed be good to tackle this long enough before 3.0 to get an idea if it works as expected.

beorn7 commented 4 months ago

I added this to the Prometheus 3 project, as the mild breakage caused by the proposed solution might be nicely dealt with by the major version bump.

beorn7 commented 4 months ago

@suntala is currently working on this.

suntala commented 4 months ago

@beorn7 pinging you to get assigned to the issue.

suntala commented 3 months ago

Something came up and unfortunately I won't be able to work on this anytime soon. Unassigning myself.

Here's the draft I'd been working on in case it's of interest to anyone else: https://github.com/suntala/prometheus/pull/1.

jcreixell commented 2 weeks ago

@beorn7 feel free to assign this to me, I will give it a try

jcreixell commented 1 week ago

I have two candidate implementations that I would love to get feedback on before going any further.

Both use the approach suggested by @beorn7 (mark for delete + delayed deletion), but have different tradeoffs.

https://github.com/jcreixell/prometheus/pull/1 uses a special label (__deleted__name__) to automatically propagate deletion information. This is the simplest implementation, but it isn't 100% transparent to the user (if, for example, a user sets a label with key __deleted__name__, the behavior would be undefined). This may be an edge case, but worth considering.

https://github.com/jcreixell/prometheus/pull/2 extends the Series and Sample structs with a boolean flag to track deletion information. While fully transparent to the user, it requires manual propagation of deletion information and introduces higher complexity.

Any feedback more than welcome :bow:

krajorama commented 1 week ago

I think people struggle with PromQL already, so anything you can do to make it intuitive (or keep it becoming more complicated) is a good idea, so I'd develop number 2. But as was noted in our sync meeting, keep the flag on series level to avoid overhead.

prometheus / prometheus