headlamp-k8s / headlamp

A Kubernetes web UI that is fully-featured, user-friendly and extensible
https://headlamp.dev
Apache License 2.0
2.41k stars 170 forks source link

Headlamp cluster metrics are not showing the proper values #2043

Open mariogkds opened 5 months ago

mariogkds commented 5 months ago

Hello, i am a new user, i really liked the project.

I am having some problems with the cluster wide metrics that are show on the dashboard:

image

I am using kube-prometheus-stack to handle prometheus and grafana and i am using prometheus-adapter for the metrics API.

To get the headlamp to even show anything i had to add a few settings to the chart's values:

kube-prometheus-stack

    kubelet:
      serviceMonitor:
        metricRelabelings:
          - action: replace
            sourceLabels:
              - node
            targetLabel: instance
    prometheus-node-exporter:
      prometheus:
        monitor:
          attachMetadata:
            node: true
          relabelings:
            - sourceLabels:
                - __meta_kubernetes_endpoint_node_name
              targetLabel: node
              action: replace
              regex: (.+)
              replacement: ${1}
          metricRelabelings:
            - action: replace
              regex: (.*)
              replacement: $1
              sourceLabels:
                - __meta_kubernetes_pod_node_name
              targetLabel: kubernetes_node

prometheus-adapter (which is normal to get the metrics apis)

      resource:
        cpu:
          containerQuery: |
            sum by (<<.GroupBy>>) (
              rate(container_cpu_usage_seconds_total{container!="",<<.LabelMatchers>>}[3m])
            )
          nodeQuery: |
            sum  by (<<.GroupBy>>) (
              rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal",<<.LabelMatchers>>}[3m])
            )
          resources:
            overrides:
              node:
                resource: node
              namespace:
                resource: namespace
              pod:
                resource: pod
          containerLabel: container
        memory:
          containerQuery: |
            sum by (<<.GroupBy>>) (
              avg_over_time(container_memory_working_set_bytes{container!="",<<.LabelMatchers>>}[3m])
            )
          nodeQuery: |
            sum by (<<.GroupBy>>) (
              avg_over_time(node_memory_MemTotal_bytes{<<.LabelMatchers>>}[3m])
              -
              avg_over_time(node_memory_MemAvailable_bytes{<<.LabelMatchers>>}[3m])
            )
          resources:
            overrides:
              node:
                resource: node
              namespace:
                resource: namespace
              pod:
                resource: pod
          containerLabel: container
        window: 3m

Individual node's CPU values are correct, the memory value is correct as well but the unit is different: image

image

Is this a headlamp problem or this a prometheus(me) problem?

Thanks for the help and the project have a nice day.

joaquimrocha commented 5 months ago

Hi @mariogkds . Thanks for the report. This looks like a unit conversion issue. We will take a look.

sarg3nt commented 2 months ago

@joaquimrocha I'm seeing this in metrics for RAM in deployments and pods too. Probably other places as well? Grafana and crictl report values correctly but headlamp is showing much more. Example, the headlamp pod, in the Headlamp UI is showing 40 MB RAM being used but it's actually 20.76 MB according to Grafana and crictl So looks like about double. CPU and network are correct. Is this going to get fixed soon, it's confusing our users. Headlamp 0.25.1

joaquimrocha commented 2 months ago

@sarg3nt Yes, we do want to fix this but haven't had the bandwidth yet. Let me try to get it in our pipeline for the next release.

skoeva commented 1 month ago

Hi @mariogkds @sarg3nt , thanks for raising these issues! Would you be able to provide the YAML (with any sensitive data redacted) for the problematic resources? Would be super helpful for testing ^^

joaquimrocha commented 1 month ago

Hi @mariogkds and @sarg3nt , we really want to address this issue but we haven't been able to reproduce. If you don't mind, please send us some sample YAML based on yours so @skoeva can take a look.

sarg3nt commented 1 month ago

@joaquimrocha sorry for the late reply. Work has been super busy. I'll get you something on Monday.

skoeva commented 3 weeks ago

We've just released our latest version :D

Just a reminder: if you guys are still running into this issue and would like us to get a fix in, your sample YAML would be super helpful to see

sarg3nt commented 2 weeks ago

@skoeva and @joaquimrocha apologies for not getting back to you. I've deployed 0.26.0 and still see the double RAM issues. Every pod I've checked so far is double, even those that just have one replica, sot it's not double counting multiple replicas. I'm not sure what you mean by sample YAML. If you mean our workloads I can show the resultant deployment and pod yaml for Headlamp as that is also one of the pods showing this double RAM reported behavior. See YAML below. Also of note, we are deploying our own custom stack of open telemetry, i.e Prometheus, Grafana, Thanos, etc. but the big thing of note is that our primary data source is Thanos, however Prometheus is showing the same memory data as Thanos as one would expect. crictl on the nodes also show the same.

Headlamp Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
    meta.helm.sh/release-name: headlamp
    meta.helm.sh/release-namespace: headlamp
  creationTimestamp: "2024-11-08T17:23:10Z"
  generation: 2
  labels:
    app.kubernetes.io/instance: headlamp
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: headlamp
    app.kubernetes.io/version: 0.26.0
    helm.sh/chart: headlamp-0.26.0
  name: headlamp
  namespace: headlamp
  resourceVersion: "4910016"
  uid: f7baaf27-b2eb-4242-86a8-61540068c8b6
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: headlamp
      app.kubernetes.io/name: headlamp
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: headlamp
        app.kubernetes.io/name: headlamp
    spec:
      containers:
      - args:
        - -in-cluster
        - -plugins-dir=/headlamp/plugins
        - -oidc-client-id=$(OIDC_CLIENT_ID)
        - -oidc-client-secret=$(OIDC_CLIENT_SECRET)
        - -oidc-idp-issuer-url=$(OIDC_ISSUER_URL)
        - -oidc-scopes=$(OIDC_SCOPES)
        envFrom:
        - secretRef:
            name: oidc
        image: ghcr.io/headlamp-k8s/headlamp:v0.26.0
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: headlamp
        ports:
        - containerPort: 4466
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 150m
            memory: 50Mi
          requests:
            cpu: 80m
            memory: 30Mi
        securityContext:
          privileged: false
          runAsGroup: 101
          runAsNonRoot: true
          runAsUser: 100
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /headlamp/plugins/logo
          name: logo
        - mountPath: /headlamp/plugins/kubeconfig-plugin
          name: kubeconfig-plugin
        - mountPath: /headlamp/plugins/sidebar_apps
          name: sidebar-apps
        - mountPath: /headlamp/plugins/sidebar_grafana
          name: sidebar-grafana
        - mountPath: /headlamp/plugins/sidebar_kyverno
          name: sidebar-kyverno
        - mountPath: /headlamp/plugins/sidebar_longhorn
          name: sidebar-longhorn
        - mountPath: /headlamp/plugins/sidebar_prometheus
          name: sidebar-prometheus
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: headlamp
      serviceAccountName: headlamp
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: logo
        name: logo
      - configMap:
          defaultMode: 420
          name: kubeconfig-plugin
        name: kubeconfig-plugin
      - configMap:
          defaultMode: 420
          name: sidebar-apps
        name: sidebar-apps
      - configMap:
          defaultMode: 420
          name: sidebar-grafana
        name: sidebar-grafana
      - configMap:
          defaultMode: 420
          name: sidebar-kyverno
        name: sidebar-kyverno
      - configMap:
          defaultMode: 420
          name: sidebar-longhorn
        name: sidebar-longhorn
      - configMap:
          defaultMode: 420
          name: sidebar-prometheus
        name: sidebar-prometheus
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2024-11-08T17:23:12Z"
    lastUpdateTime: "2024-11-08T17:23:12Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2024-11-08T17:23:10Z"
    lastUpdateTime: "2024-11-12T17:29:08Z"
    message: ReplicaSet "headlamp-5847d9f6c8" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 2
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

Pod

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: f5d91f659ec1fd08c943a8c768114cdbc4d3084d1a25f5bc8a3573d8898e9fff
    cni.projectcalico.org/podIP: 192.168.4.9/32
    cni.projectcalico.org/podIPs: 192.168.4.9/32
  creationTimestamp: "2024-11-12T17:29:00Z"
  generateName: headlamp-5847d9f6c8-
  labels:
    app.kubernetes.io/instance: headlamp
    app.kubernetes.io/name: headlamp
    pod-template-hash: 5847d9f6c8
  name: headlamp-5847d9f6c8-hrwsr
  namespace: headlamp
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: headlamp-5847d9f6c8
    uid: 20da7856-a5f1-4fd3-a833-894018b9ef63
  resourceVersion: "4910002"
  uid: 55c9125b-e7df-443c-9806-77386b9860bd
spec:
  containers:
  - args:
    - -in-cluster
    - -plugins-dir=/headlamp/plugins
    - -oidc-client-id=$(OIDC_CLIENT_ID)
    - -oidc-client-secret=$(OIDC_CLIENT_SECRET)
    - -oidc-idp-issuer-url=$(OIDC_ISSUER_URL)
    - -oidc-scopes=$(OIDC_SCOPES)
    envFrom:
    - secretRef:
        name: oidc
    image: ghcr.io/headlamp-k8s/headlamp:v0.26.0
    imagePullPolicy: Always
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /
        port: http
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: headlamp
    ports:
    - containerPort: 4466
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /
        port: http
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: 150m
        memory: 50Mi
      requests:
        cpu: 80m
        memory: 30Mi
    securityContext:
      privileged: false
      runAsGroup: 101
      runAsNonRoot: true
      runAsUser: 100
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /headlamp/plugins/logo
      name: logo
    - mountPath: /headlamp/plugins/kubeconfig-plugin
      name: kubeconfig-plugin
    - mountPath: /headlamp/plugins/sidebar_apps
      name: sidebar-apps
    - mountPath: /headlamp/plugins/sidebar_grafana
      name: sidebar-grafana
    - mountPath: /headlamp/plugins/sidebar_kyverno
      name: sidebar-kyverno
    - mountPath: /headlamp/plugins/sidebar_longhorn
      name: sidebar-longhorn
    - mountPath: /headlamp/plugins/sidebar_prometheus
      name: sidebar-prometheus
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-hfnn7
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: <redacted>
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: headlamp
  serviceAccountName: headlamp
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      name: logo
    name: logo
  - configMap:
      defaultMode: 420
      name: kubeconfig-plugin
    name: kubeconfig-plugin
  - configMap:
      defaultMode: 420
      name: sidebar-apps
    name: sidebar-apps
  - configMap:
      defaultMode: 420
      name: sidebar-grafana
    name: sidebar-grafana
  - configMap:
      defaultMode: 420
      name: sidebar-kyverno
    name: sidebar-kyverno
  - configMap:
      defaultMode: 420
      name: sidebar-longhorn
    name: sidebar-longhorn
  - configMap:
      defaultMode: 420
      name: sidebar-prometheus
    name: sidebar-prometheus
  - name: kube-api-access-hfnn7
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-11-12T17:29:08Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-11-12T17:29:00Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-11-12T17:29:08Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-11-12T17:29:08Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-11-12T17:29:00Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://f8b9e1067abe28139478bc338826e71f28c9e0d25c0b46a724d23d43d02ae030
    image: ghcr.io/headlamp-k8s/headlamp:v0.26.0
    imageID: ghcr.io/headlamp-k8s/headlamp@sha256:c47fd232a8be2a8756706e3c2af13f23787b0bf1276831b711fa5eaef17390b2
    lastState: {}
    name: headlamp
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-11-12T17:29:07Z"
  hostIP: 10.105.148.72
  hostIPs:
  - ip: 10.105.148.72
  phase: Running
  podIP: 192.168.4.9
  podIPs:
  - ip: 192.168.4.9
  qosClass: Burstable
  startTime: "2024-11-12T17:29:00Z"

If you need to see any manifest data for your prometheus / thanos deployment let me know. The version of Thanos we are running is thanos:0.35.1-debian-12-r2 The version of Prometheus Operator is prometheus-operator:v0.75.0

My week is very open as most of my teem is at KubeCon so if this doesn't help I can hop on a call and do a screen share to look at whatever you'd like to.