[prometheus-kube-stack] no cpu, memory metrics for pods

Va1 commented 3 years ago

Describe the bug a clear and concise description of what the bug is.

so, upon installing the latest chart version (17.2.1) on the latest EKS (recently upgraded to 1.21) and checking Grafana, i've noticed that there's "no data" everywhere for pods.

and upon checking in prometheus, i've realized that at least pod/container CPU & memory metrics are not present at all.

What's your helm version?

version.BuildInfo{Version:"v3.6.3", GitCommit:"d506314abfb5d21419df8c7e7e68012379db2354", GitTreeState:"dirty", GoVersion:"go1.16.5"}

What's your kubectl version?

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:56:19Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

Which chart?

kube-prometheus-stack

What's the chart version?

17.2.1

What happened?

pods/containers CPU & memory metrics are missing

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

defaultRules:
  create: true
  rules:
    alertmanager: true
    etcd: true
    general: true
    k8s: true
    kubeApiserver: true
    kubeApiserverAvailability: true
    kubeApiserverError: true
    kubeApiserverSlos: true
    kubelet: true
    kubePrometheusGeneral: true
    kubePrometheusNodeAlerting: true
    kubePrometheusNodeRecording: true
    kubernetesAbsent: true
    kubernetesApps: true
    kubernetesResources: true
    kubernetesStorage: true
    kubernetesSystem: true
    kubeScheduler: true
    kubeStateMetrics: true
    network: true
    node: true
    prometheus: true
    prometheusOperator: true
    time: true
  appNamespacesTarget: ".*"
alertmanager:
  enabled: true
  ingress:
    enabled: false
    annotations:
      nginx.org/mergeable-ingress-type: minion
    ingressClassName: main
    pathType: ImplementationSpecific
    hosts:
      - alertmanager.smth.io
    tls:
      - secretName: tls-certs--alertmanager.smth.io
        hosts:
          - alertmanager.smth.io
  alertmanagerSpec:
    logFormat: json
    logLevel: debug
    replicas: 1
    retention: 168h
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp2
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 20Gi
    externalUrl: https://alertmanager.smth.io
    nodeSelector:
      rd/node-type: monitoring
    tolerations:
      - key: rd/node-type-dedicated
        operator: Equal
        value: monitoring
        effect: NoSchedule
    resources:
      limits:
        memory: 600Mi
        cpu: 300m
      requests:
        memory: 400Mi
        cpu: 100m
grrdana:
  enabled: true
  adminPassword: testtesttesttest
  ingress:
    enabled: true
    annotations:
      nginx.org/mergeable-ingress-type: minion
    ingressClassName: main
    pathType: ImplementationSpecific
    hosts:
      - grrdana.smth.io
    tls:
      - secretName: tls-certs--grrdana.smth.io
        hosts:
          - grrdana.smth.io
kubeApiServer:
  enabled: true
kubelet:
  enabled: true
  serviceMonitor:
    https: false
kubeControllerManager:
  enabled: true
coreDns:
  enabled: true
kubeDns:
  enabled: false
kubeEtcd:
  enabled: true
kubeScheduler:
  enabled: true
kubeProxy:
  enabled: true
kubeStateMetrics:
  enabled: true
nodeExporter:
  enabled: true
prometheusOperator:
  enabled: true
  admissionWebhooks:
    failurePolicy: Fail
    enabled: true
    patch:
      enabled: true
      nodeSelector:
        rd/node-type: monitoring
      tolerations:
        - key: rd/node-type-dedicated
          operator: Equal
          value: monitoring
          effect: NoSchedule
      resources:
        requests:
          memory: 100Mi
          cpu: 100m
        limits:
          memory: 300Mi
          cpu: 300m
  namespaces:
    releaseNamespace: true
    additional:
      - kube-system
      - argo
      - production
      - krdka-operator
      - krdka-cluster
  logFormat: json
  logLevel: debug
  kubeletService:
    enabled: false
  nodeSelector:
    rd/node-type: monitoring
  tolerations:
    - key: rd/node-type-dedicated
      operator: Equal
      value: monitoring
      effect: NoSchedule
  resources:
    requests:
      cpu: 100m
      memory: 200Mi
    limits:
      cpu: 300m
      memory: 300Mi
prometheus:
  enabled: true
  ingress:
    enabled: true
    annotations:
      nginx.org/mergeable-ingress-type: minion
    ingressClassName: main
    pathType: ImplementationSpecific
    hosts:
      - prometheus.smth.io
    tls:
      - secretName: tls-certs--prometheus.smth.io
        hosts:
          - prometheus.smth.io
  prometheusSpec:
    externalUrl: https://prometheus.smth.io
    retention: 7d
    retentionSize: 25GB
    walCompression: true
    logLevel: debug
    logFormat: json
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp2
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 30Gi
    nodeSelector:
      rd/node-type: monitoring
    tolerations:
      - key: rd/node-type-dedicated
        operator: Equal
        value: monitoring
        effect: NoSchedule
    resources:
      limits:
        memory: 1200Mi
        cpu: 500m
      requests:
        memory: 1Gi
        cpu: 200m

Enter the command that you execute and failing/misfunctioning.

helm install prometheus-stack prometheus-community/kube-prometheus-stack --version 17.2.1 --values values.yaml

Anything else we need to know?

No response

oreststetsiak commented 3 years ago

same for me, have updated from 15.4.5 to 18.0.1 and got "No data" for CPU for pods

jakubhajek commented 3 years ago

Where did you deploy your cluster? I mean is this on-prem installation or do you use any cloud provider?

I face the same issue on bare-metal with Debian 10 with kernel 4.19, I suspect that it might be somehow related to CPU Accounting.

oreststetsiak commented 3 years ago

Where did you deploy your cluster? I mean is this on-prem installation or do you use any cloud provider?

I face the same issue on bare-metal with Debian 10 with kernel 4.19, I suspect that it might be somehow related to CPU Accounting.

in my case, its GKE - 1.20.8-gke.900

oreststetsiak commented 3 years ago

I can update to helm release 16.6.3, and CPU data is present just fine, if I do update fro 16.6.4 (or any later) - I got no data for CPU, but Memory still works just fine

jakubhajek commented 3 years ago

I solved that issue on my side. In my case, it was related to kube-state-metrics and its service monitor.

First, check if that query works correctly:

kube_pod_info{namespace="monitoring", pod="kube-prometheus-stack-kube-state-metrics-77ffcf4f67-f8qj7"}

If you got results, probably you are facing a different issues.

In my case, I had no results because of the ports that has been used in ServiceMonitor. The service that exposes Kube-state-ports had named port http, but in the service monitor it uses named port metrics. Thus, Service Monitor can't reach Kube-state-ports svc and can't get the detailed metrics. The exposed port from Service should match the port used in ServiceMonitor.

apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: kube-prometheus-stack
    meta.helm.sh/release-namespace: monitoring
    prometheus.io/scrape: "true"
  creationTimestamp: "2021-06-02T13:47:47Z"
  labels:
    app.kubernetes.io/instance: kube-prometheus-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-state-metrics
    helm.sh/chart: kube-state-metrics-3.4.2
    helm.toolkit.fluxcd.io/name: kube-prometheus-stack
    helm.toolkit.fluxcd.io/namespace: monitoring
spec:
  clusterIP: 10.233.60.122
  ports:
  - name: http
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app.kubernetes.io/instance: kube-prometheus-stack
    app.kubernetes.io/name: kube-state-metrics
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: kube-prometheus-stack
    meta.helm.sh/release-namespace: monitoring
  creationTimestamp: "2021-06-02T13:47:48Z"
  generation: 3
  labels:
    app: kube-prometheus-stack-kube-state-metrics
    app.kubernetes.io/instance: kube-prometheus-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 18.0.2
    chart: kube-prometheus-stack-18.0.2
    helm.toolkit.fluxcd.io/name: kube-prometheus-stack
    helm.toolkit.fluxcd.io/namespace: monitoring
    heritage: Helm
    release: kube-prometheus-stack
spec:
  endpoints:
  - honorLabels: true
    port: http
  jobLabel: app.kubernetes.io/name
  selector:
    matchLabels:
      app.kubernetes.io/instance: kube-prometheus-stack
      app.kubernetes.io/name: kube-state-metrics

I hope that helps. Let me know.

oreststetsiak commented 3 years ago

yep, looks like mine is different

oreststetsiak commented 3 years ago

looks like you are right, here is the change introduced between 6.6.3 and 6.6.4

CodechCFA commented 3 years ago

Having the same issue. I can confirm that the port mismatch is not my issue as that mismatch doesn't seem to be present on 18.0.2. I'm running EKS 1.21.

Va1 commented 3 years ago

@oreststetsiak thank you for commenting. unfortunately, downgrading to 16.6.3 did not do the trick for me – still no memory or CPU in Prometheus.

@jakubhajek and thanks for your suggestion. in fact, the query you posted returns results, so i'm facing a different issue.

still have not found a solution to this, unfortunately. is there a good alternative to this helm chart? ideally, a stack chart with all components bundled.

Macbet commented 3 years ago

I think the problem is that the current dashboards refer to the label image!="" but if we make query _container_cpu_usage_seconds_total{job="kubelet", metricspath="/metrics/cadvisor"} we see that this label is no longer there and now pod is here version 18.0.5 and EKS-1.21.2

Macbet commented 3 years ago

i replace labels from container_memory_working_set_bytes{cluster="$cluster", namespace="$namespace", container!="", image!=""} to container_memory_working_set_bytes{cluster="$cluster", namespace="$namespace", pod!=""} and got this i think it need deep refactor

haskjold commented 3 years ago

Hi

I believe the queries are correct and should include image!="", especially given that they have worked before. If the image tag and possible other tags are empty then there are most likely missing metrics meta data.

When I was hit by this issue with No Data in most of my Grafana dashboards, I traced it back to the fact that I had switched to the containerd runtime when I upgraded from EKS 1.20 to EKS 1.21. It turns out that the AWS AMI uses a non-default socket for containerd (/run/dockershim.sock) instead of the default one (/run/containerd/containerd.sock). This causes cadvisor to fail to fetch metrics from the container runtime because it expects it to be available at the default socket location.

Switching back to the docker container runtime fixes the problem and I again get all the metrics I expect. You can also do some creative symlinking to fix this or wait for the fix to be released: https://github.com/awslabs/amazon-eks-ami/pull/724

There might of course be other problems here, but I advice against rewriting all the queries as I don't believe that is the root cause :)

Macbet commented 3 years ago

@haskjold thank you very much for telling me about this bug, you saved me from long hours of unnecessary work

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

Va1 commented 2 years ago

@haskjold thank you very much for your answer.

if the recommended solution is to switch back to docker container runtime, what would be the easiest way to achieve this while staying with AWS EKS and EKS worker groups?

same question applies to creative symlinking.

Angelin01 commented 2 years ago

@Va1 The latest EKS AMIs already have the symlink workaround built into it. Simply update your cluster to use any AMI >= v20211001

I can verify it is working perfectly on our deployment.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue is being automatically closed due to inactivity.

lukerogers commented 2 years ago

In case anyone is as dumb as I am, I'll leave this here.

I was enabling collecting data for verticalpodautoscalers and added the following config to my values.yaml file

kube-state-metrics:
  collectors:
    - verticalpodautoscalers

That effectively removed ALL OTHER collectors. I had to update the list to include the full list and things came back

prometheus-community / helm-charts