K3S emitting duplicated metrics in all endpoints (Api server, kubelet, kube-proxy, kube-scheduler, etc)

Bug Description

Kuberentes Documentation - System Metrics details which Kubernetes components expose metrics in Prometheus format:

These components are:

kube-controller-manager (exposing /metrics endpoint at TCP 10257)
kube-proxy (exposing /metrics endpoint at TCP 10249)
kube-apiserver (exposing /metrics at Kubernetes API port)
kube-scheduler (exposing /metrics endpoint at TCP 10259)
kubelet (exposing /metrics, /metrics/cadvisor, /metrics/resource and /metrics/probes endpoints at TCP 10250)

K3S distribution has a special behavior since in each node only one process is deployed ( k3s-server running on master nodes or k3s-agent running on worker nodes) with all k8s components sharing the same memory.

K3s is emitting the same metrics, from all k8s components deployed in the node, at all '/metrics' endpoints available (api-server, kubelet (TCP 10250), kube-proxy (TCP 10249), kube-scheduler (TCP 10251), kube-controller-manager (TCP 10257). Thus, collecting from all port produces metrics duplicates.

kubelet additional metrics (endpoints /metrics/cadvisor, /metrics/resource and /metrics/probes) are only available at TCP 10250.

Enabling the scraping of all different metrics TCP ports (kubernetes components) causes the ingestion of duplicated metrics. Duplicated metrics in Prometheus need to be removed in order to reduce memory and CPU consumption.

Context Information

As stated in issue #22, there was a known issue in K3S: https://github.com/k3s-io/k3s/issues/2262, where duplicated metrics are emitted by the three components (kube-proxy, kube-scheduler and kube-controller-manager). The proposed solution by Rancher Monitoring(https://github.com/k3s-io/k3s/issues/2262), was to avoid the scrape of duplicated metrics and activate only the service monitoring of one of the components. (i.e. kube-proxy). That solution was implemented (see https://github.com/ricsanfre/pi-cluster/issues/22#issuecomment-986224709) and it solved the main issue (out-of-memory).

Endpoints currently being scrapped by Prometheus are

api-server (TCP 6553)
kubelet (TCP 10250)
kube-proxy (TCP 10249)

Duplicated metrics

After deeper analysis on the metrics scrapped by Prometheus, it is clear that K3S is emitting duplicated metrics in all endpoints.

Example 1: API-server metrics emitted by kube-proxy, kubelet and api-server endpoints running on master server

Example 2: kubelet metrics emitted by kube-proxy, kubelet and api-server

Example3: kubepoxy metrics: kubeproxy_sync_proxy_rules_duration_seconds_bucket{le="0.001"}

Procedure for obtaining raw metrics exposed by K3S.

The procedure described here https://github.com/SUSE/doc-caasp/issues/166#issuecomment-476191064 can be used to manually query https metrics endpoints. Most recent versions of Kubernetes are moving all metrics endpoint to use https.

For example: TCP ports numbers exposed by kube-scheduler and kube-controller-manager have changed from kubernetes release 1.22 (from 10251/10252 to 10257/10259) and now require https authenticated connection. Kubernetes authorized service account is needed. Only kube-proxy endpoint remains open using HTTP, the rest of the ports are now using HTTPS

The procedure specified above creates a service account with not enough privileges to query directly kubelet metrics endpoints. The following service account, role and rolebinding resources need to be created:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: monitoring
  namespace: kube-system
secrets:
- name: monitoring-secret-token
---
apiVersion: v1
kind: Secret
metadata:
  name: monitoring-secret-token
  namespace: kube-system
  annotations:
    kubernetes.io/service-account.name: monitoring
type: kubernetes.io/service-account-token
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-clusterrole
  namespace: kube-system
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/metrics
  - pods
  verbs: ["get", "list"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: monitoring-clusterrole-binding
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: monitoring-clusterrole
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: monitoring
  namespace: kube-system

Following script can be used to automatically extract metrics from kubelet, kube-proxy and apiserver endpoints and compare the results:

#!/bin/bash

# Get token
TOKEN=$(kubectl -n kube-system get secrets monitoring-secret-token -ojsonpath='{.data.token}' | base64 -d)

APISERVER=$(kubectl config view | grep server | cut -f 2- -d ":" | tr -d " ")

# Get apiserver
curl -ks $APISERVER/metrics  --header "Authorization: Bearer $TOKEN" | grep -v "# " > apiserver.txt

# Get list of nodes of k3s cluster from api server and iterate over it
for i in `kubectl get nodes -o json | jq -r '.items[].status.addresses[0].address'`; do
  echo "Getting metrics from node $i"
  curl -ks https://$i:10250/metrics --header "Authorization: Bearer $TOKEN" | grep -v "# " > kubelet_$i.txt
  curl -ks https://$i:10250/metrics/cadvisor --header "Authorization: Bearer $TOKEN" | grep -v "# " > kubelet_cadvisor_$i.txt
  curl -ks http://$i:10249/metrics | grep -v "# " > kubeproxy_$i.txt
done

# Get kube-controller and kube-scheduler

for i in `kubectl get nodes -o json | jq -r '.items[] | select(.metadata.labels."node-role.kubernetes.io/master" != null) | .status.addresses[0].address'`; do
  echo "Getting metrics from master node $i"
  curl -ks https://$i:10259/metrics --header "Authorization: Bearer $TOKEN" | grep -v "# " > kube-scheduler_$i.txt
  curl -ks https://$i:10257/metrics --header "Authorization: Bearer $TOKEN" | grep -v "# " > kube-controller_$i.txt
done

Analyzing the results

Executing the previous script, the following files contains the metrics extracted from each of the exposed ports in each of the nodes of the cluster:

apiserver.txt kube-controller_node1.txt kubelet_cadvisor_node1.txt kubelet_cadvisor_node2.txt kubelet_cadvisor_node3.txt kubelet_cadvisor_node4.txt kubelet_node1.txt kubelet_node2.txt kubelet_node3.txt kubelet_node4.txt kubeproxy_node1.txt kubeproxy_node2.txt kubeproxy_node3.txt kubeproxy_node4.txt kube-scheduler_node1.txt

Checking metrics extracted from node1 (master) endpoints, all ports are exposing the same number of metrics:
```
~$ wc -l kubelet_node1.txt 
40666 kubelet_node1.txt
~$ wc -l kubeproxy_node1.txt 
40666 kubeproxy_node1.txt
~$ wc -l kube-controller_node1.txt 
40666 kube-controller_node1.txt
~$ wc -l kube-scheduler_node1.txt 
40666 kube-scheduler_node1.txt
~$ wc -l apiserver.txt 
40666 apiserver.txt
```
The metrics in the files are the same, when applying diff command the only differences showed are the values in some of the metrics (counters/seconds). This is due to that the different ports are polled in different times so, the counter of seconds type metric is showing different values
Checking metrics extracted from node2 (worker) endpoints, all ports are exposing the same number of metrics:
```
~$ wc -l kubelet_node2.txt 
1723 kubelet_node2.txt
~$ wc -l kubeproxy_node2.txt 
1723 kubeproxy_node2.txt
```
and again the differences are only in the values of counters(seconds) type metrics.

Conclusion

To get all k3s metrics it is enough with collecting metrics from kubelet endpoints (/metrics, /metrics/cadvisor and /metrics/probe) in all nodes

Possible solution.

Enabling only monitoring of kubelet endpoints /metrics, /metrics/cadvisor and /metrics/probes available on TCP port 10250, so all metrics can be collected. This is the same solution rancher monitoring chart seems to be using (https://github.com/rancher/rancher/issues/29445).

Changes to be implemented:

1) Remove from kube-prometheus-stack chart the creation of objects for monitoring all kubernetes componentes (including apiserver and kubelet).

```yml
prometheusOperator:
  kubeletService:
    enabled: false
kubelet:
  enabled: false
kubeApiServer:
  enabled: false
kubeControllerManager:
  enabled: false
kubeScheduler:
  enabled: false
kubeProxy:
  enabled: false
kubeEtcd:
  enabled: false
```

2) Create headless service pointing to TCP 10250 port of all k3s nodes.

```yml
---
# Headless service for K3S metrics. No selector
apiVersion: v1
kind: Service
metadata:
  name: k3s-metrics-service
  labels:
    app.kubernetes.io/name: k3s
  namespace: kube-system
spec:
  clusterIP: None
  ports:
  - name: https-metrics
    port: 10250
    protocol: TCP
    targetPort: 10250
  type: ClusterIP
---
# Endpoint for the headless service without selector
apiVersion: v1
kind: Endpoints
metadata:
  name: k3s-metrics-service
  namespace: kube-system
subsets:
- addresses:
  - ip: 10.0.0.11
  - ip: 10.0.0.12
  - ip: 10.0.0.13
  - ip: 10.0.0.14
  ports:
  - name: https-metrics
    port: 10250
    protocol: TCP
```

3) Create a single ServiceMonitor resource to enable the collection of all k8s components metrics from unique port TCP 10250. This ServiceMonitor should include all relabeling rules defined by default by the ServiceMonitor resources that kube-prometheus-stack chart creates by default for each individual k8s component.

```yml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    release: kube-prometheus-stack
  name: k3s-monitoring
  namespace: k3s-monitoring
spec:
  endpoints:
  # /metrics endpoint
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    honorLabels: true
    metricRelabelings:
    # apiserver
    - action: drop
      regex: apiserver_request_duration_seconds_bucket;(0.15|0.2|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2|3|3.5|4|4.5|6|7|8|9|15|25|40|50)
      sourceLabels:
      - __name__
      - le
    port: https-metrics
    relabelings:
    - action: replace
      sourceLabels:
      - __metrics_path__
      targetLabel: metrics_path
    scheme: https
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecureSkipVerify: true
  # /metrics/cadvisor
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    honorLabels: true
    metricRelabelings:
    - action: drop
      regex: container_cpu_(cfs_throttled_seconds_total|load_average_10s|system_seconds_total|user_seconds_total)
      sourceLabels:
      - __name__
    - action: drop
      regex: container_fs_(io_current|io_time_seconds_total|io_time_weighted_seconds_total|reads_merged_total|sector_reads_total|sector_writes_total|writes_merged_total)
      sourceLabels:
      - __name__
    - action: drop
      regex: container_memory_(mapped_file|swap)
      sourceLabels:
      - __name__
    - action: drop
      regex: container_(file_descriptors|tasks_state|threads_max)
      sourceLabels:
      - __name__
    - action: drop
      regex: container_spec.*
      sourceLabels:
      - __name__
    path: /metrics/cadvisor
    port: https-metrics
    relabelings:
    - action: replace
      sourceLabels:
      - __metrics_path__
      targetLabel: metrics_path
    scheme: https
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecureSkipVerify: true
    # /metrics/probes
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    honorLabels: true
    path: /metrics/probes
    port: https-metrics
    relabelings:
    - action: replace
      sourceLabels:
      - __metrics_path__
      targetLabel: metrics_path
    scheme: https
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecureSkipVerify: true
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app.kubernetes.io/name: k3s
```

4) Add manually Grafana dashboards corresponding to K8s components (api-server, kubelet, proxy, etc.). They are not installed when disabling monitoring of k8s components in kube-prometheus-stack chart installation:

  - kubelet dashboard: [ID 16361](https://grafana.com/grafana/dashboards/16361-kubernetes-kubelet/)
  - apiserver dashboard [ID 12654](https://grafana.com/grafana/dashboards/12654-kubernetes-api-server)
  - etcd dashboard [ID 16359](https://grafana.com/grafana/dashboards/16359-etcd/)
  - kube-scheduler [ID 12130](https://grafana.com/grafana/dashboards/12130-kubernetes-scheduler/)
  - kube-controller-manager [ID 12122](https://grafana.com/grafana/dashboards/12122-kubernetes-controller-manager)
  - kube-proxy [ID 12129](https://grafana.com/grafana/dashboards/12129-kubernetes-proxy)

5) Add manually PrometheusRules of the disabled components. Chart also does not install them when disabling its monitoring.

  kube-prometheus-stack creates different PrometheusRules resources, but all of them are included in single manifest file in source repository (https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/kubernetesControlPlane-prometheusRule.yaml)

NOTE: Both PrometheusRules and Grafana Dashboards might need modifications. It includes metrics filtered by job label (kubelet, apiserver, etc.) and with the proposed solution only job label "k3s" will be used

Final solution setting job label to "kubelet" for all metrics scrapped for k3s components through kubelet port. This way only a few dashboards need to be changed. (kube-proxy, kube-controller-manager and apiserver).

Selecting a different name such as "k3s" (initial proposed solution) makes that all default kube-prometheus-stack dashboards using kubelet metrics (container metrics), need to be updated. For example: The following dashboards use "job=kubelet" when filtering the metrics. Kubernetes - Compute Resources /Cluster Kubernetes - Compute Resources / Namespace (Pods) Kubernetes - Compute Resources / Namespace (Workloads)

@ricsanfre First this repo and the accompanying website are awesome. Thanks for your efforts.

Regarding this issue I want to let you know that I've solved it in a little bit of a different manner that ensures that the kube-prometheus-stack chart is still creating the rules and grafana dashboards thus eliminating the need to manually handle this step.

So instead of disabling all the components in the Helm chart I actually keep them enabled but instruct all but the kubelet ServiceMonitor to drop all the metrics they scrape

e.g. This is how I defined my kubeApiServer section in my values.yaml file

kubeApiServer:
  serviceMonitor:
    metricRelabelings:
      - action: drop
        regex: .*
        sourceLabels:
          - __name__

I have a similar snippet for kubeControllerManager, kubeProxy, and kubeScheduler

With this the Chart is still creating the rules and dashboards without ingesting duplicate metrics. Only metrics from the kubelet are kept.

Now the rules and dashboards created by the chart refer to a job that needs to be replaced with kubelet so I make use of a very simple Argo CD Config Management Plugin.

In the init command I use helm template to generate the templates and then in the generate command I run a couple of sed commands that replace the job values with kubelet.

The end result is

All rules and dashboards are automatically created by the chart with the correct job values
Only 1 copy of metrics is ingested (The ones from the kubelet endpoint)

The only drawback is that although Prometheus doesn't ingest duplicate metrics it still ends up scraping multiple end points and dropping the metrics from these endpoints which of course means relatively higher CPU and memory usage.

One idea that just occurred to me to address the drawback is to set the interval of the ServiceMonitor to a very high value thus technically preventing Prometheus from even scraping the end points.

@sherif-fanous, thank you so much for sharing your ideas.

Would it be possible to share your values.yaml and especially a small example how to run the sed commands with the Config Management Plugin?

The relevant sections of my values.yaml. Keep in mind this is a k3s single node cluster running on TrueNAS Scale. You might have a slightly different setup than mine especially regarding etcd and kube-proxy

kubeApiServer:
  serviceMonitor:
    interval: 1d
    metricRelabelings:
      - action: drop
        regex: .*
        sourceLabels:
          - __name__

kubeControllerManager:
  endpoints:
    - 192.168.4.59
  serviceMonitor:
    https: true
    insecureSkipVerify: true
    interval: 1d
    metricRelabelings:
      - action: drop
        regex: .*
        sourceLabels:
          - __name__

kubeEtcd:
  enabled: false

kubelet:
  serviceMonitor:
    metricRelabelings:
      - action: drop
        regex: apiserver_request_duration_seconds_bucket;(0.15|0.2|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2|3|3.5|4|4.5|6|7|8|9|15|25|40|50)
        sourceLabels:
          - __name__
          - le

kubeProxy:
  enabled: false

kubeScheduler:
  endpoints:
    - 192.168.4.59
  serviceMonitor:
    https: true
    insecureSkipVerify: true
    interval: 1d
    metricRelabelings:
      - action: drop
        regex: .*
        sourceLabels:
          - __name__

The sed command is in the Argo CD Application manifest. Here's what it looks like

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  annotations:
    argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
    argocd.argoproj.io/sync-wave: '32'
  finalizers:
    - resources-finalizer.argocd.argoproj.io
  name: kube-prometheus-stack
  namespace: argo-cd
spec:
  destination:
    namespace: kube-prometheus-stack
    server: https://kubernetes.default.svc
  project: default
  source:
    chart: kube-prometheus-stack
    repoURL: https://prometheus-community.github.io/helm-charts
    targetRevision: 58.2.1
  sources:
    - chart: kube-prometheus-stack
      plugin:
        name: config-management-plugin-template
        parameters:
          - name: generate-command
            string: >-
              sed -E -i 's/job="(apiserver|kube-scheduler|kube-controller-manager)"/job="kubelet"/g' ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml && sed -E -i 's/job=\\"(apiserver|kube-scheduler|kube-controller-manager)\\"/job=\\"kubelet\\"/g' ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml && sed -E -i 's/sum\(up\{cluster=\\"\$cluster\\", job=\\"kubelet\\"\}\)/sum\(up\{cluster=\\"\$cluster\\",job=\\"kubelet\\", metrics_path=\\"\/metrics\\"\}\)/g' ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml && cat ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml
          - name: init-command
            string: >-
              mkdir -p ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/ && helm template . --create-namespace --namespace prometheus-stack --values ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/helm/values/base/helm-kube-prometheus-stack-values.yaml --values ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/helm/values/overlays/truenas-mini-x-plus/helm-kube-prometheus-stack-values.yaml >
              ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml
      repoURL: https://prometheus-community.github.io/helm-charts
      targetRevision: 58.2.1
    - path: kubernetes/apps/kube-prometheus-stack/kustomize/overlays/truenas-mini-x-plus
      repoURL: git@github.com:ifanous/home-lab.git
      targetRevision: HEAD
    - ref: root
      repoURL: git@github.com:ifanous/home-lab.git
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    retry:
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
      limit: 5
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true

P.S. My repo is private so you won't be able to access it but still everything you need is in this thread, just replace every reference to my repo with yours.

You also need to setup Argo CD to use a CMP plugin. At a high level here's what I'm doing in my Argo CD values.yaml

configs:
  cmp:
    create: true
    plugins:
      config-management-plugin-template:
        generate:
          args:
            - |
              echo "Starting generate phase for application $ARGOCD_APP_NAME" 1>&2;
              echo "Executing $PARAM_GENERATE_COMMAND" 1>&2;
              eval $PARAM_GENERATE_COMMAND;
              echo "Successfully completed generate phase for application $ARGOCD_APP_NAME" 1>&2;
          command: [/bin/sh, -c]
        init:
          args:
            - |
              echo "Starting init phase for application $ARGOCD_APP_NAME" 1>&2;
              echo "Starting a partial treeless clone of repo ifanous/home-lab.git" 1>&2; mkdir ifanous 1>&2; cd ifanous 1>&2; git clone -n --depth=1 --filter=tree:0 https://$IFANOUS_HOME_LAB_HTTPS_USERNAME:$IFANOUS_HOME_LAB_HTTPS_PASSWORD@github.com/ifanous/home-lab.git 1>&2; cd home-lab/ 1>&2; git sparse-checkout set --no-cone $ARGOCD_APP_NAME 1>&2; git checkout 1>&2;
              echo "Successfully completed a partial treeless clone of repo ifanous/home-lab.git" 1>&2;
              echo "Executing $PARAM_INIT_COMMAND" 1>&2;
              cd ../../ 1>&2; eval $PARAM_INIT_COMMAND;
              echo "Successfully completed init phase for application $ARGOCD_APP_NAME" 1>&2;
          command: ["/bin/sh", "-c"]

repoServer:
  extraContainers:
    - args:
        - '--logformat=json'
        - '--loglevel=debug'
      command:
        - /var/run/argocd/argocd-cmp-server
      env:
        - name: IFANOUS_HOME_LAB_HTTPS_PASSWORD
          valueFrom:
            secretKeyRef:
              key: password
              name: argocd-repo-creds-ifanous-home-lab-https
        - name: IFANOUS_HOME_LAB_HTTPS_USERNAME
          valueFrom:
            secretKeyRef:
              key: username
              name: argocd-repo-creds-ifanous-home-lab-https
      image: alpine/k8s:1.29.2
      name: config-management-plugin-template
      resources:
        limits:
          memory: 512Mi
        requests:
          memory: 64Mi
      securityContext:
        runAsNonRoot: true
        runAsUser: 999
      volumeMounts:
        - mountPath: /var/run/argocd
          name: var-files
        - mountPath: /home/argocd/cmp-server/plugins
          name: plugins
        - mountPath: /home/argocd/cmp-server/config/plugin.yaml
          name: argocd-cmp-cm
          subPath: config-management-plugin-template.yaml
        - mountPath: /tmp
          name: cmp-tmp

Thank you very much!

ricsanfre / pi-cluster