Open samjewell opened 1 year ago
Thanks, @samjewell. As you might have guessed from our personal communication, this touches two areas I have been thinking a bit about in the past, but never refined my thoughts enough to propose them in a design doc or to the dev summit. Let's take your issue as an opportunity to lay them out in their current rough form:
The first is essentially what you have proposed as your first idea above: Don't drop the name but prefix or suffix it somehow. My thought was to use a prefix that expresses the operation, and also prefix the prefix with a double underscore to mark the internal nature. In your example, the prefix would come out as something like __count_over_time:the_original_metricname
.
Playing devil's advocate, here are concerns (with counter arguments in parentheses):
__2_times:prometheus_build_info
.)The second thought is about using the timestamp of the evaluation even when the operation is essentially "in place", only affecting a single sample, related to your second idea. It's kind of funny how timestamp(prometheus_build_info)
gives you actual timestamps from the TSDB, but timestamp(+prometheus_build_info)
gives you the evaluation timestamp. Maybe we would only really need to pick the evaluation timestamp when we are actually aggregating samples with different timestamps. This would be a huge change to the evaluation model and certainly only something to consider for Prometheus v3. I would also not really consider it a good solution to your problem here, but this has come up quite often in different contexts, e.g. timestamp(last_over_time(prometheus_build_info[10m]))
or generally to enable range selectors on this whole class of "non-aggregating operations", avoiding sub queries there.
About the ViMe option for that: Adding a separate option and thus mental overhead for everyone reading the docs for such a niche use case would make me really sad. If that first idea of yours (and mine) flew, it would be almost invisible for the naive user. Things would "just work".
Inspired by this issue, I had added an agenda item for the dev-summit many months ago. However, it wasn't discussed so far as other agenda items were deemed more pressing.
This makes it hard to go forward with any of the more invasive approaches. However, as already discussed in the dev-summit agenda item (search for "Revisit metric name removal during PromQL evaluation" in the dev summit document, sorry, no deep linking possible), there are less invasive approaches, and I think we can go forward with one of them in the meantime, to at least have some way of addressing the problem at hand.
The idea I propose here is to only drop the metric name as the very last step, before the result is returned to the outside world. This would not change the outcome of any working query, but it would allow to apply the label_replace
trick later. @samjewell's problem could be solved by writing:
topk(10, label_replace(count_over_time({__name__!=""}[1m]), "name_label", "$1", "__name__", "(.+)"))
And in cases, where the ambiguity gets aggregated away anyway, formerly broken queries would just start to work. E.g. to answer the question how many metrics have had more than 10 samples in the last minute:
count(count_over_time({__name__!=""}[1m]) > 10)
The only trick is how to track that a metric name has to be removed before returning the result. But that's an implementation detail we should be able to solve.
The Admin then needs to identify which metrics are responsible for the high DPM; which are the worst offenders. I've tried to find these with the following query:
topk(10, count_over_time({__name__!=""}[1m]))
I've since found a better way to identify the sources of high resolution metrics (high datapoints per minute per series):
Query with sort_desc(count_over_time(scrape_samples_scraped[1m]))
to get the data points per minute per series by target. This works a treat in most cases.
There's an exception for metrics which weren't created from a scrape at all, for example those created by recording rules. For those we'd still benefit from the improvement requested here.
I would like to note that the concrete use case that served as an example here is by far not the only one where users run into the dreaded vector cannot contain metrics with the same labelset
error. I would still like to see a solution around the "drop names as the last step" idea. (And in Prom 3.x or something, I would like to see a solution around (not) changing the timetamps to the evaluation time.)
The idea I propose here is to only drop the metric name as the very last step, before the result is returned to the outside world.
This doesn't resolve the case when metric names should be returned in the response. For example, the following query could be used for returning resident and virtual memory usage:
max_over_time({__name__=~"process_(resident|virtual)_memory_bytes"}[5m])
Unfortunately, it doesn't work in Prometheus now and will not work after the proposed idea is implemented. The keep_metric_names modifier from VictoriaMetrics elegantly resolves this issue. Note that the keep_metric_names
modifier doesn't complicate queries by default. Users may add it if they really want keeping metric names in the results of some PromQL or MetricsQL function.
Unfortunately, it doesn't work in Prometheus now and will not work after the proposed idea is implemented.
If the metric name is dropped as the last thing, you can still use label_replace
to move the __name__
label into a normal label.
The keep_metric_names modifier from VictoriaMetrics elegantly resolves this issue.
I don't think it's a good idea. From the dev-summit notes: "Firstly, it exposes everybody to yet another knob they have to think about, while being only relevant for very few (the use case is a niche in a niche). Secondly, it ignores the problem that motivated the name removal in the first place (you are now creating misleading metric names)."
We discussed this topic during the last dev-summit. I try to summarize the results here:
label_replace
to work on range vectorsThis would solve most use cases in a seemingly easy way, but at a closer look, it comes with a lot of implications:
label_replace
would have to work on both instant vectors and range vectors).label_replace
acting on an instant vector returns an instant vector, a future label_replace
acting on a range vector would need to return a range vector).Each of the above is something that we can very well discuss, and it might become a new feature in Prometheus 3.x, maybe being beneficial in many different ways, not just for this one feature request. However, to just solve the feature request at hand, it feels like a PromQL change way too fundamental and invasive.
A more moderate variant would be to create a new function label_replace_range_vector
. A function returning a range vector with unmodified timestamps would still be a new concept, but it would be much less invasive change to PromQL. However, this would add a new function for a use case that is in a niche of a niche, but it would add mental overhead for everyone (adds to the documentation, and is leaving many with the question "What is this function even for?"). Same concern as for the ViMe approach above.
__name__
label as the last stepUnfortunately, the discussion exposed that this would not be a transparent change. Certain "weird" queries would behave differently, so it would be technically a breaking change. Take the following query as an example:
sum by (__name__) (rate({foo="bar"}[5m]))
It is currently equivalent to sum(rate({foo="bar"}[5m]))
, while it would (ironically) create the vector cannot contain metrics with the same labelset
error message with the proposed change. (The sum
would be partitioned by the not-yet-removed __name__
label, which would then be removed, creating the labelset collision.) One might argue the query above is nonsensical, and while a human would probably never type such a query, queries are often created via some templating so that the no-op addition of by (__name__)
can actually happen in practice, and we really cannot break those queries. (Especially autogenerated ones will wreak havoc in subtle ways.)
More generally, every query dealing with __name__
explicitly has the potential of behaving differently.
In that same line of thought, it should be noted that a __name__
label marked for removal has to be un-marked if it is explicitly overridden later via label_replace
. This also allows to create a new meaningful name, e.g.
label_replace(rate({foo="bar"}[5m]), "__name__", "rate5m_of_$1", "__name__", "(.*)")
Overall, I don't think the caveats found kill the approach, but it has to stay behind a feature flag until the next major release after all, which reduces its elegance quite a bit. (We would have initially put it behind a feature flag anyway just to be cautious, but we could have removed the flag eventually if the change had been non-breaking.)
Pending any other new ideas or further evidence of implications, I would propose to give the "late __name__
removal" approach a try, even though it's now a bit more complicated than it appeared initially.
This may be getting out of hand in complexity, but could we detect the
by (__name__)
__name__
stacking in the AST and rewrite the query to drop the __name__
aggregation? Would that always be compatible? Then we could drop this compatibility shim in 3.0 without having to gate all the good effects of this change.
+1 to this issue. I ran into a similar problem where I was trying to count samples ingested broken down by a label key of interest. Specifically, I have a label aws_account
which gets applied to all my time series and helps me determine which time series are being scraped from which aws_account
. I then want to count the rate of samples being sent from each account so I can detect if any single account suddenly starts massively increasing the sample rate it sends (e.g., maybe someone misconfigured something).
I was thinking I could do something like sum by(aws_account) (count_over_time( {__name__ =~ ".+" } ) )
but then I ran into the same execution: vector cannot contain metrics with the same labelset
issue as above.
With v3.0 projected for 2024, the above annoyance about requiring a feature flag before 3.0 has become less bad, because 3.0 is going to happen quite soon rather than "at some point in the distant future". So it would indeed be good to tackle this long enough before 3.0 to get an idea if it works as expected.
I added this to the Prometheus 3 project, as the mild breakage caused by the proposed solution might be nicely dealt with by the major version bump.
@suntala is currently working on this.
@beorn7 pinging you to get assigned to the issue.
Something came up and unfortunately I won't be able to work on this anytime soon. Unassigning myself.
Here's the draft I'd been working on in case it's of interest to anyone else: https://github.com/suntala/prometheus/pull/1.
@beorn7 feel free to assign this to me, I will give it a try
I have two candidate implementations that I would love to get feedback on before going any further.
Both use the approach suggested by @beorn7 (mark for delete + delayed deletion), but have different tradeoffs.
https://github.com/jcreixell/prometheus/pull/1 uses a special label (__deleted__name__
) to automatically propagate deletion information. This is the simplest implementation, but it isn't 100% transparent to the user (if, for example, a user sets a label with key __deleted__name__
, the behavior would be undefined). This may be an edge case, but worth considering.
https://github.com/jcreixell/prometheus/pull/2 extends the Series
and Sample
structs with a boolean flag to track deletion information. While fully transparent to the user, it requires manual propagation of deletion information and introduces higher complexity.
Any feedback more than welcome :bow:
I think people struggle with PromQL already, so anything you can do to make it intuitive (or keep it becoming more complicated) is a good idea, so I'd develop number 2. But as was noted in our sync meeting, keep the flag on series level to avoid overhead.
This issue was previously named as:
Problem / Use-Case
As an Admin/Manager/Operator of a Prometheus instance one use-case is to understand, then manage-down the usage on the instance and the cost of running it. Sometimes there are a few metrics with very high DPM (ie. high frequency of scraping / small scrape-interval), which increase "usage" on the instance, and increase costs. For a multi-tenant project which implements PromQL (such as Mimir) this appears as a DPM overage and increase in the dollar costs for that tenant. The Admin then needs to identify which metrics are responsible for the high DPM; which are the worst offenders. I've tried to find these with the following query:
topk(10, count_over_time({__name__!=""}[1m]))
Unfortunately this fails with:
execution: vector cannot contain metrics with the same labelset
Boiling this example down a bit, this error is present with the query
count_over_time({__name__!=""}[1m])
. And it's present for my small instance which only has 4,200 metrics. Google suggests to me a typical workaround for this error is to use label_replace to move the original metric-name into a temporary label. So incorporating this fix I get to this query:topk(10, count_over_time(label_replace({__name__!=""},"name_label","$1","__name__", "(.+)")[1m:]))
But I've now introduced another problem. (Can you spot it?) I had to switch from
[1m]
to[1m:]
ie. convert to a subquery, to avoidparse error: ranges only allowed for vector selectors
. But by switching to a subquery,count_over_time
no longer counting the DPM of the underlying metrics/series, but instead counts the subquery-evaluations during the interval in question!So my question is this: How can I identify the series that have the highest DPM via PromQL, so I can effectively manage cost and usage on my instance? (Or if it weren't via PromQL, how would we change Prometheus to better support this use-case, and how would operators of multi-tenant Prometheus services such as Mimir support their tenants to achieve the same).
Can we make changes to Prometheus to better support this use-case?
Note: I'm aware that the query I'm trying to run can be very computationally expensive, and may fail if the result-set is too big. But I'm not aware of alternatives for the case I'm describing.
Proposal / Ideas
I believe
vector cannot contain metrics with the same labelset
happens because Prometheus removes the metric-name when performingcount_over_time
. It does this for any function which changes the 'dimensions' of the metric. This is an opinionated decision taken by the project in the past - I think the opinion is that 'no metric name is better than the wrong metric name', and metrics names typically do include dimensions (such as…_seconds
). But personally (as an instance admin, and as someone who builds features for the instance-admin persona) I've found myself hitting this error message a few times, and found it a barrier to learning and using Prometheus.Some ideas then, to start some discussion:
Somehow always stop users from hitting the error
vector cannot contain metrics with the same labelset
, while also retaining the rule that "we mustn't show the wrong metric name".__name__
or possibly a different label-key, such asoriginal_name
)Somehow always stop users from the hitting the error
parse error: ranges only allowed for vector selectors
when they have been forced to uselabel_replace
(by the metric-name being removed). ie. Have Prometheus considerlabel_replace
not as a true part of the query which needs to be evaluated, but as a simple rename which can happen somehow differently at the start/end of the querying process...? I lack the words to describe this as I don't know the codebase, but I discussed this with @beorn7 a few months ago, and he thought it was a possibility (although maybe not until Prom v3).Could we implement the keep_metric_names option that VictoriaMetrics has implemented?
“extend the cardinality API to cover time ranges” (Bryan Boreham’s suggestion - I think for a Mimir implementation)