kruize / autotune

Autonomous Performance Tuning for Kubernetes!
Apache License 2.0
165 stars 54 forks source link

422 error found duplicate series for the match group #1353

Closed msvinaykumar closed 4 weeks ago

msvinaykumar commented 4 weeks ago

Description

Container Query gives this error if it finds duplicate record

{
  "status": "error",
  "errorType": "execution",
  "error": "found duplicate series for the match group {namespace=\"cadvisor\", pod=\"cadvisor-pgf6n\"} on the right hand-side of the operation: [{namespace=\"cadvisor\", pod=\"cadvisor-pgf6n\", prometheus=\"monitoring/k8stage\", prometheus_replica=\"prometheus-k8stage-1\", workload=\"cadvisor\", workload_type=\"daemonset\"}, {namespace=\"cadvisor\", pod=\"cadvisor-pgf6n\", prometheus=\"monitoring/k8stage\", prometheus_replica=\"prometheus-k8stage-0\", workload=\"cadvisor\", workload_type=\"daemonset\"}];many-to-many matching not allowed: matching labels must be unique on one side"
}

This error occurs in Prometheus when two sets of data with overlapping labels are involved in a query that requires unique matches on one side. Specifically, it indicates that there are duplicate series on the right-hand side with the same labels {namespace="cadvisor", pod="cadvisor-pgf6n"}. Fixes # (issue)

query is resulting in a "many-to-many matching not allowed" error because both kube_pod_container_info and namespace_workload_pod:kube_pod_owner:relabel have overlapping label sets that prevent unique matching on the right-hand side of the * operation.

Here’s how you can refine it to avoid the conflict:

Adjust the Labels for Unique Matching: Use on and group_left() correctly to ensure the labels specified for matching (pod and namespace) are unique across one side of the operation. You can try refining the query by removing unnecessary labels that may create duplication issues.

Aggregate by prometheus_replica (if present): If duplicate series with the prometheus_replica label are causing the conflict, you can aggregate over this label by using avg or sum functions to consolidate the data, removing the prometheus_replica dimension.

sum by (container, image, workload, workload_type, namespace) (
  avg_over_time(kube_pod_container_info{container!=""}[15d]) 
  * on (pod, namespace) group_left(workload, workload_type) 
  avg by (pod, namespace, workload) (avg_over_time(namespace_workload_pod:kube_pod_owner:relabel{workload!=""}[15d]))
)

Type of change

How has this been tested?

Please describe the tests that were run to verify your changes and steps to reproduce. Please specify any test configuration required.

Test Configuration

Checklist :dart:

Additional information

Include any additional information such as links, test results, screenshots here

kusumachalasani commented 4 weeks ago

@msvinaykumar Can you please try

sum by (container, image, workload, workload_type, namespace) (
  avg_over_time(kube_pod_container_info{}[15d]) *
  on (pod, namespace,prometheus_replica) group_left(workload, workload_type)
   avg_over_time(namespace_workload_pod:kube_pod_owner:relabel{}[15d])
)

and compare what is the performance difference between this query and query mentioned in the code.

msvinaykumar commented 4 weeks ago

@msvinaykumar Can you please try

sum by (container, image, workload, workload_type, namespace) (
  avg_over_time(kube_pod_container_info{}[15d]) *
  on (pod, namespace,prometheus_replica) group_left(workload, workload_type)
   avg_over_time(namespace_workload_pod:kube_pod_owner:relabel{}[15d])
)

and compare what is the performance difference between this query and query mentioned in the code.

Done. Not much performance impact