[kube-prometheus-stack] intermittent prometheus failure

abhinavDhulipala commented 1 year ago

Describe the bug a clear and concise description of what the bug is.

Prometheus keeps intermittently failing and coming back online. Here is an example:

I also have a dev prometheus instance which gives me the following data.

This seems to only fail for certain metrics. It's also worth noting that this started failing all of a sudden and with no apparent cause. This is a subchart for a deployed tobs chart that I have deployed. The pod seems to be running and it's logs suddenly show the following

ts=2023-04-24T00:52:28.602Z caller=manager.go:640 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-tobs-kube-prometheus-prometheus-rulefiles-0/observability-tobs-kube-prometheus-kubelet.rules-2ad2a987-8abb-47cb-99fa-6fd7501da949.yaml group=kubelet.rules name=node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile index=2 msg="Evaluating rule failed" rule="record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.5, sum by (cluster, instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m]))\n  * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"})\nlabels:\n  quantile: \"0.5\"\n" err="found duplicate series for the match group {instance=\"192.168.110.101:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"192.168.110.101:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"ts2.ba.rivosinc.com\", service=\"tobs0-kube-prometheus-stac-kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"192.168.110.101:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"ts2.ba.rivosinc.com\", service=\"tobs-kube-prometheus-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"
ts=2023-04-24T00:52:37.494Z caller=dedupe.go:112 component=remote level=info remote_name=5a38dd url=http://tobs0-promscale.observability.svc:9201/write msg="Done replaying WAL" duration=41.506985997s
ts=2023-04-24T00:52:45.988Z caller=dedupe.go:112 component=remote level=info remote_name=5a38dd url=http://tobs0-promscale.observability.svc:9201/write msg="Remote storage resharding" from=1 to=56
ts=2023-04-24T00:52:55.988Z caller=dedupe.go:112 component=remote level=info remote_name=5a38dd url=http://tobs0-promscale.observability.svc:9201/write msg="Remote storage resharding" from=56 to=91
ts=2023-04-24T00:52:57.492Z caller=manager.go:640 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-tobs-kube-prometheus-prometheus-rulefiles-0/observability-tobs0-kube-prometheus-stac-kubernetes-system-kubelet-e0b6878f-4e26-40a2-a3e7-317ce0c1fd50.yaml group=kubernetes-system-kubelet name=KubeletPodStartUpLatencyHigh index=5 msg="Evaluating rule failed" rule="alert: KubeletPodStartUpLatencyHigh\nexpr: histogram_quantile(0.99, sum by (cluster, instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job=\"kubelet\",metrics_path=\"/metrics\"}[5m])))\n  * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"}\n  > 60\nfor: 15m\nlabels:\n  severity: warning\nannotations:\n  description: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds\n    on node {{ $labels.node }}.\n  runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh\n  summary: Kubelet Pod startup latency is too high.\n" err="found duplicate series for the match group {instance=\"192.168.110.101:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"192.168.110.101:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"ts2.ba.rivosinc.com\", service=\"tobs0-kube-prometheus-stac-kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"192.168.110.101:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"ts2.ba.rivosinc.com\", service=\"tobs-kube-prometheus-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side

I don't see why this seems to be suddenly happening. I've also checked that my pvc's/pv's are in fine health as I have this deployed on a kadm administered local cluster.

What's your helm version?

version.BuildInfo{Version:"v3.10.1", GitCommit:"9f88ccb6aee40b9a0535fcc7efea6055e1ef72c9", GitTreeState:"clean", GoVersion:"go1.18.7"}

What's your kubectl version?

Client Version: v1.24.13 Kustomize Version: v4.5.4 Server Version: v1.24.0

Which chart?

kube-prometheus-stack

What's the chart version?

39.9.0

What happened?

Operation was normal until this suddenly started happening. We have been adding more scrape targets, but I don't think that should result in this behavior.

What you expected to happen?

Expected for operation to continue as normal

How to reproduce it?

Not completely sure

Enter the changed values of values.yaml?

These are close to the defaults for tobs 14.3.0

  fullnameOverride: "tobs-kube-prometheus"
  alertmanager:
    alertmanagerSpec:
      image:
        repository: quay.io/prometheus/alertmanager
        tag: v0.24.0
      replicas: 1
      ## AlertManager resource requests
      resources:
        limits:
          memory: 100Mi
          cpu: 100m
        requests:
          memory: 50Mi
          cpu: 4m
  prometheusOperator:
    image:
      repository: quay.io/prometheus-operator/prometheus-operator
      tag: v0.58.0
      pullPolicy: IfNotPresent
    ## Prometheus config reloader configuration
    prometheusConfigReloader:
      # image to use for config and rule reloading
      image:
        repository: quay.io/prometheus-operator/prometheus-config-reloader
        tag: v0.58.0
      # resource config for prometheusConfigReloader
      resources:
        requests:
          cpu: 100m
          memory: 50Mi
        limits:
          cpu: 200m
          memory: 50Mi
    ## Prometheus Operator resource requests
    resources:
      limits:
        memory: 200Mi
        cpu: 100m
      requests:
        memory: 100Mi
        cpu: 10m
  prometheus:
    prometheusSpec:
      image:
        repository: quay.io/prometheus/prometheus
        tag: v2.43.0
      replicaExternalLabelName: "__replica__"
      prometheusExternalLabelName: "cluster"
      remoteRead:
        - url: "http://{{ .Release.Name }}-promscale.{{ .Release.Namespace }}.svc:9201/read"
          readRecent: true

      # ref: https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#remotewritespec
      remoteWrite:
        - url: "http://{{ .Release.Name }}-promscale.{{ .Release.Namespace }}.svc:9201/write"
      storageSpec:
        disableMountSubPath: true
        volumeClaimTemplate:
          spec:
            accessModes:
              - "ReadWriteOnce"
            resources:
              requests:
                storage: 8Gi
      additionalScrapeConfigsSecret:
        enabled: true
        name: additional-scrape-configs
        key: prometheus-additional.yml

Enter the command that you execute and failing/misfunctioning.

kubectl logs prometheus-tobs-kube-prometheus-prometheus-0

Anything else we need to know?

This remote writes to promscale which has been deprecated. We are in the process of migrating away from this but it seems like problem lies with prometheus within this chart. Any attempt to upgrade individual images have resulted in the service becoming non-functional for a variety of reason. I'm more interested in figuring out the source of this man-to-many mismatch and why it suddenly started happening. Please let me know what other information I could provide

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

tzuhsunn commented 7 months ago

I met the similar problem and the CPU usage are shown like the screenshot. Then I found that my node-exporter worked well and so was all the pods except for prometheus-adapter not receiving metrics from node-exporter.

For me, the problem occurred after the HA cluster change the master node. Node-exporter did not change its settings for CoreDNS IP (it didn't change to new master). So I simply deleted the node-exporter pods and restarted it.

prometheus-community / helm-charts