prometheus / prometheus

The Prometheus monitoring system and time series database.
https://prometheus.io/
Apache License 2.0
55.37k stars 9.1k forks source link

Inconsistent `vector cannot contain metrics with the same labelset` errors for functions over range vectors #14695

Open charleskorn opened 2 months ago

charleskorn commented 2 months ago

What did you do?

Running a query like max_over_time({__name__=~"metric_.*"}) produces inconsistent results when run at individual steps rather than a single range query that evaluates at the same steps.

I've summarised the issue with a test case in promqltest syntax:

load 6m
  metric_1{common="label"} 0 1 _ _ 4 5
  metric_2{common="label"} _ _ 2 3 _ 6

# No conflicts, should merge series into one output series.
#
# This succeeds.
eval range from 0 to 24m step 6m ceil({__name__=~"metric_.*"})
  {common="label"} 0 1 2 3 4

# Same as above, but with conflict at T=30m.
#
# This succeeds (ie. it returns the expected error message).
eval_fail range from 0 to 30m step 6m ceil({__name__=~"metric_.*"})
  expected_fail_message vector cannot contain metrics with the same labelset

# Same two cases as above, but with a function that takes a range vector.
#
# All of these single step range queries succeed. Range queries that only select metric_1 or metric_2 (eg. 0 to 6m, or 12m to 18m) also succeed.
eval range from 0 to 0 step 1m max_over_time({__name__=~"metric_.*"}[5m])
  {common="label"} 0

eval range from 6m to 6m step 1m max_over_time({__name__=~"metric_.*"}[5m])
  {common="label"} 1

eval range from 12m to 12m step 1m max_over_time({__name__=~"metric_.*"}[5m])
  {common="label"} 2

eval range from 18m to 18m step 1m max_over_time({__name__=~"metric_.*"}[5m])
  {common="label"} 3

eval range from 24m to 24m step 1m max_over_time({__name__=~"metric_.*"}[5m])
  {common="label"} 4

# This is the problematic case: 
# 
# This range query takes in all of the above steps fails with "vector cannot contain metrics with the same labelset"
eval range from 0 to 24m step 6m max_over_time({__name__=~"metric_.*"}[5m])
  {common="label"} 0 1 2 3 4

# This succeeds (ie. it returns the expected error message).
eval_fail range from 0 to 30m step 6m max_over_time({__name__=~"metric_.*"}[5m])
  expected_fail_message vector cannot contain metrics with the same labelset

(I've used eval range throughout as eval instant runs into a legitimate instance of vector cannot contain metrics with the same labelset when it runs a range query equivalent of the expression.)

What did you expect to see?

All test cases behave as expected, ie. are consistent regardless of the time range queried.

What did you see instead? Under which circumstances?

The eval range from 0 to 24m step 6m max_over_time({__name__=~"metric_.*"}[5m]) scenario fails with vector cannot contain metrics with the same labelset.

System information

No response

Prometheus version

No response

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

No response

roidelapluie commented 1 week ago

We should not change this behavior in Prometheus. The error you're encountering when multiple series have identical labels after applying functions like max_over_time is intentional. It serves as a useful alert to potential misconfigurations or labeling issues in your metrics.

In real-world scenarios, scrapes are not perfectly aligned, if we fix this, such label conflicts are unlikely unless there's an actual problem. By erroring out, Prometheus helps identify and fix issues that could compromise the accuracy of the monitoring data.

Therefore, it's important to let Prometheus continue raising this error to maintain data integrity and alert users to potential metric labeling problems...

charleskorn commented 1 week ago

I understand the importance of the error, but it is not being consistently returned.

In the example above, I run a query (max_over_time({__name__=~"metric_.*"}[5m])) evaluated at time 0, another at time 6, another at 12, another at 18 and a final one at time 24. I don't get an error for any of these single timestamp queries.

But, if I run a single range query from 0 to time 24 with a step of 6, which evaluates the same expression at the same timestamps as the individual queries, I do get an error.

This is also inconsistent with the behaviour of the ceil({__name__=~"metric_.*"}) case, which matches my expectations.