[kube-prometheus-stack] Crash loop back on startup (matching labels must be unique on one side)

Jamsa commented 1 year ago

Describe the bug a clear and concise description of what the bug is.

Crash loopback on startup. matching labels must be unique on one side

What's your helm version?

version.BuildInfo{Version:"v3.9.3", GitCommit:"414ff28d4029ae8c8b05d62aa06c7fe3dee2bc58", GitTreeState:"clean", GoVersion:"go1.17.13"}

What's your kubectl version?

Client Version: v1.24.3 Kustomize Version: v4.5.4 Server Version: v1.21.5

Which chart?

kube-prometheus-stack

What's the chart version?

42.3.0

What happened?

ts=2023-04-24T12:26:22.131Z caller=main.go:1221 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=445.781959ms db_storage=26.996µs remote_storage=6.487µs web_handler=1.266µs query_engine=2.241µs scrape=22.247212ms scrape_sd=82.817005ms notify=1.645616ms notify_sd=3.038914ms rules=277.574397ms tracing=1.287696ms
ts=2023-04-24T12:26:22.131Z caller=main.go:965 level=info msg="Server is ready to receive web requests."
ts=2023-04-24T12:26:22.131Z caller=manager.go:943 level=info component="rule manager" msg="Starting rule manager..."
ts=2023-04-24T12:26:44.111Z caller=manager.go:638 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-kube-prome-prometheus-rulefiles-0/monitoring-kube-prometheus-kube-prome-kubernetes-system-kubelet-0b034194-7f11-4bac-af6e-474aa7b075c2.yaml group=kubernetes-system-kubelet name=KubeletPodStartUpLatencyHigh index=5 msg="Evaluating rule failed" rule="alert: KubeletPodStartUpLatencyHigh\nexpr: histogram_quantile(0.99, sum by (cluster, instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job=\"kubelet\",metrics_path=\"/metrics\"}[5m])))\n  * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"}\n  > 60\nfor: 15m\nlabels:\n  severity: warning\nannotations:\n  description: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds\n    on node {{ $labels.node }}.\n  runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh\n  summary: Kubelet Pod startup latency is too high.\n" err="found duplicate series for the match group {instance=\"10.15.26.12:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.12:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplaner-1\", service=\"kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.12:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplaner-1\", service=\"kube-prometheus-kube-prome-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
ts=2023-04-24T12:26:48.738Z caller=manager.go:638 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-kube-prome-prometheus-rulefiles-0/monitoring-kube-prometheus-kube-prome-kubelet.rules-b558bbcb-8faa-4fdd-a05a-80418d0f5777.yaml group=kubelet.rules name=node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile index=0 msg="Evaluating rule failed" rule="record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.99, sum by (cluster, instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m]))\n  * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"})\nlabels:\n  quantile: \"0.99\"\n" err="found duplicate series for the match group {instance=\"10.15.26.10:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.10:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplane-0\", service=\"kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.10:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplane-0\", service=\"kube-prometheus-kube-prome-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
ts=2023-04-24T12:26:49.175Z caller=manager.go:638 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-kube-prome-prometheus-rulefiles-0/monitoring-kube-prometheus-kube-prome-kubelet.rules-b558bbcb-8faa-4fdd-a05a-80418d0f5777.yaml group=kubelet.rules name=node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile index=1 msg="Evaluating rule failed" rule="record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.9, sum by (cluster, instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m]))\n  * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"})\nlabels:\n  quantile: \"0.9\"\n" err="found duplicate series for the match group {instance=\"10.15.26.10:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.10:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplane-0\", service=\"kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.10:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplane-0\", service=\"kube-prometheus-kube-prome-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
ts=2023-04-24T12:26:49.537Z caller=manager.go:638 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-kube-prometheus-kube-prome-prometheus-rulefiles-0/monitoring-kube-prometheus-kube-prome-kubelet.rules-b558bbcb-8faa-4fdd-a05a-80418d0f5777.yaml group=kubelet.rules name=node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile index=2 msg="Evaluating rule failed" rule="record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.5, sum by (cluster, instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m]))\n  * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"})\nlabels:\n  quantile: \"0.5\"\n" err="found duplicate series for the match group {instance=\"10.15.26.10:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.10:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplane-0\", service=\"kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"10.15.26.10:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"nwa-prod-controlplane-0\", service=\"kube-prometheus-kube-prome-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

No response

Enter the command that you execute and failing/misfunctioning.

helm install kube-prometheus --namespace monitoring kube-prometheus-stack -f values.yaml

alertmanager:
  enabled: true
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: rook-ceph-block
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

grafana:
  enabled: true
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: nginx
    hosts:
      - grafana.mycluster.mycompany.local
  persistence:
    type: pvc
    enabled: true
    storageClassName: rook-ceph-block
    accessModes:
      - ReadWriteOnce
    size: 10Gi

prometheus:
  enabled: true
  thanosService:
    enabled: true
  thanosServiceMonitor:
    enabled: false
  extraSecret:
    name: thanos-objstore-config
    data:
      thanos-storage-config.yaml: |-
        type: S3
        config:
          bucket: thanos-data
          endpoint: minio.minio.svc.cluster.local:9000
          access_key: access_key
          secret_key: access_pass
          insecure: true
  prometheusSpec:
    disableCompaction: true
    #retention: 2h
    retention: 20d
    replicas: 2
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: rook-ceph-block
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 500Gi
    thanos:
      objectStorageConfig:
        key: "thanos-storage-config.yaml"
        name: "thanos-objstore-config"

Anything else we need to know?

I checked this issue and tried to delete the svc in the kube-system namespace. But it's not working, and after deleting the svc, it was created again. https://github.com/prometheus-community/helm-charts/issues/635#issuecomment-774771566

Jamsa commented 1 year ago

I also tried 45.6.0 , and got the same error.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

law commented 1 year ago

I am seeing the same thing. Please do not close this automatically.

eht16 commented 1 year ago

We had the same issue. In our case the reason was that the Helm chart was first deployed with another release name, then the chart was uninstalled but it left some resources in the cluster and on deploying it with the changed release name, we saw the error. Removing the old resources (IIRC it was some services) fixed the problem.

hypery2k commented 6 months ago

happens to me also with 58.5.3 This solved my issues:

kubectl -n kube-system delete svc prometheus-kube-prometheus-kubelet

anroots-by commented 4 months ago

In our case the reason was that the Helm chart was first deployed with another release name, then the chart was uninstalled but it left some resources in the cluster

This was my case as well - some Service-s in kube-system were no cleaned up when uninstalling a release, and were affecting a new release with a different name. Manually deleting the two services with old release name from kube-system solved it.

prometheus-community / helm-charts