Closed ricsanfre closed 2 years ago
The procedure described here https://github.com/SUSE/doc-caasp/issues/166#issuecomment-476191064 can be used to manually query https metrics endpoints. Most recent versions of Kubernetes are moving all metrics endpoint to use https.
For example: TCP ports numbers exposed by kube-scheduler and kube-controller-manager have changed from kubernetes release 1.22 (from 10251/10252 to 10257/10259) and now require https authenticated connection. Kubernetes authorized service account is needed. Only kube-proxy endpoint remains open using HTTP, the rest of the ports are now using HTTPS
The procedure specified above creates a service account with not enough privileges to query directly kubelet metrics endpoints. The following service account, role and rolebinding resources need to be created:
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: monitoring
namespace: kube-system
secrets:
- name: monitoring-secret-token
---
apiVersion: v1
kind: Secret
metadata:
name: monitoring-secret-token
namespace: kube-system
annotations:
kubernetes.io/service-account.name: monitoring
type: kubernetes.io/service-account-token
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: monitoring-clusterrole
namespace: kube-system
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/metrics
- pods
verbs: ["get", "list"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: monitoring-clusterrole-binding
namespace: kube-system
roleRef:
kind: ClusterRole
name: monitoring-clusterrole
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: monitoring
namespace: kube-system
Following script can be used to automatically extract metrics from kubelet, kube-proxy and apiserver endpoints and compare the results:
#!/bin/bash
# Get token
TOKEN=$(kubectl -n kube-system get secrets monitoring-secret-token -ojsonpath='{.data.token}' | base64 -d)
APISERVER=$(kubectl config view | grep server | cut -f 2- -d ":" | tr -d " ")
# Get apiserver
curl -ks $APISERVER/metrics --header "Authorization: Bearer $TOKEN" | grep -v "# " > apiserver.txt
# Get list of nodes of k3s cluster from api server and iterate over it
for i in `kubectl get nodes -o json | jq -r '.items[].status.addresses[0].address'`; do
echo "Getting metrics from node $i"
curl -ks https://$i:10250/metrics --header "Authorization: Bearer $TOKEN" | grep -v "# " > kubelet_$i.txt
curl -ks https://$i:10250/metrics/cadvisor --header "Authorization: Bearer $TOKEN" | grep -v "# " > kubelet_cadvisor_$i.txt
curl -ks http://$i:10249/metrics | grep -v "# " > kubeproxy_$i.txt
done
# Get kube-controller and kube-scheduler
for i in `kubectl get nodes -o json | jq -r '.items[] | select(.metadata.labels."node-role.kubernetes.io/master" != null) | .status.addresses[0].address'`; do
echo "Getting metrics from master node $i"
curl -ks https://$i:10259/metrics --header "Authorization: Bearer $TOKEN" | grep -v "# " > kube-scheduler_$i.txt
curl -ks https://$i:10257/metrics --header "Authorization: Bearer $TOKEN" | grep -v "# " > kube-controller_$i.txt
done
Executing the previous script, the following files contains the metrics extracted from each of the exposed ports in each of the nodes of the cluster:
apiserver.txt kube-controller_node1.txt kubelet_cadvisor_node1.txt kubelet_cadvisor_node2.txt kubelet_cadvisor_node3.txt kubelet_cadvisor_node4.txt kubelet_node1.txt kubelet_node2.txt kubelet_node3.txt kubelet_node4.txt kubeproxy_node1.txt kubeproxy_node2.txt kubeproxy_node3.txt kubeproxy_node4.txt kube-scheduler_node1.txt
Checking metrics extracted from node1 (master) endpoints, all ports are exposing the same number of metrics:
~$ wc -l kubelet_node1.txt
40666 kubelet_node1.txt
~$ wc -l kubeproxy_node1.txt
40666 kubeproxy_node1.txt
~$ wc -l kube-controller_node1.txt
40666 kube-controller_node1.txt
~$ wc -l kube-scheduler_node1.txt
40666 kube-scheduler_node1.txt
~$ wc -l apiserver.txt
40666 apiserver.txt
The metrics in the files are the same, when applying diff
command the only differences showed are the values in some of the metrics (counters/seconds). This is due to that the different ports are polled in different times so, the counter of seconds type metric is showing different values
Checking metrics extracted from node2 (worker) endpoints, all ports are exposing the same number of metrics:
~$ wc -l kubelet_node2.txt
1723 kubelet_node2.txt
~$ wc -l kubeproxy_node2.txt
1723 kubeproxy_node2.txt
and again the differences are only in the values of counters(seconds) type metrics.
To get all k3s metrics it is enough with collecting metrics from kubelet endpoints (/metrics
, /metrics/cadvisor
and /metrics/probe
) in all nodes
Enabling only monitoring of kubelet endpoints /metrics
, /metrics/cadvisor
and /metrics/probes
available on TCP port 10250, so all metrics can be collected. This is the same solution rancher monitoring chart seems to be using (https://github.com/rancher/rancher/issues/29445).
Changes to be implemented:
1) Remove from kube-prometheus-stack chart the creation of objects for monitoring all kubernetes componentes (including apiserver
and kubelet
).
```yml
prometheusOperator:
kubeletService:
enabled: false
kubelet:
enabled: false
kubeApiServer:
enabled: false
kubeControllerManager:
enabled: false
kubeScheduler:
enabled: false
kubeProxy:
enabled: false
kubeEtcd:
enabled: false
```
2) Create headless service pointing to TCP 10250 port of all k3s nodes.
```yml
---
# Headless service for K3S metrics. No selector
apiVersion: v1
kind: Service
metadata:
name: k3s-metrics-service
labels:
app.kubernetes.io/name: k3s
namespace: kube-system
spec:
clusterIP: None
ports:
- name: https-metrics
port: 10250
protocol: TCP
targetPort: 10250
type: ClusterIP
---
# Endpoint for the headless service without selector
apiVersion: v1
kind: Endpoints
metadata:
name: k3s-metrics-service
namespace: kube-system
subsets:
- addresses:
- ip: 10.0.0.11
- ip: 10.0.0.12
- ip: 10.0.0.13
- ip: 10.0.0.14
ports:
- name: https-metrics
port: 10250
protocol: TCP
```
3) Create a single ServiceMonitor
resource to enable the collection of all k8s components metrics from unique port TCP 10250. This ServiceMonitor should include all relabeling rules defined by default by the ServiceMonitor
resources that kube-prometheus-stack chart creates by default for each individual k8s component.
```yml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
release: kube-prometheus-stack
name: k3s-monitoring
namespace: k3s-monitoring
spec:
endpoints:
# /metrics endpoint
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
honorLabels: true
metricRelabelings:
# apiserver
- action: drop
regex: apiserver_request_duration_seconds_bucket;(0.15|0.2|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2|3|3.5|4|4.5|6|7|8|9|15|25|40|50)
sourceLabels:
- __name__
- le
port: https-metrics
relabelings:
- action: replace
sourceLabels:
- __metrics_path__
targetLabel: metrics_path
scheme: https
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecureSkipVerify: true
# /metrics/cadvisor
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
honorLabels: true
metricRelabelings:
- action: drop
regex: container_cpu_(cfs_throttled_seconds_total|load_average_10s|system_seconds_total|user_seconds_total)
sourceLabels:
- __name__
- action: drop
regex: container_fs_(io_current|io_time_seconds_total|io_time_weighted_seconds_total|reads_merged_total|sector_reads_total|sector_writes_total|writes_merged_total)
sourceLabels:
- __name__
- action: drop
regex: container_memory_(mapped_file|swap)
sourceLabels:
- __name__
- action: drop
regex: container_(file_descriptors|tasks_state|threads_max)
sourceLabels:
- __name__
- action: drop
regex: container_spec.*
sourceLabels:
- __name__
path: /metrics/cadvisor
port: https-metrics
relabelings:
- action: replace
sourceLabels:
- __metrics_path__
targetLabel: metrics_path
scheme: https
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecureSkipVerify: true
# /metrics/probes
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
honorLabels: true
path: /metrics/probes
port: https-metrics
relabelings:
- action: replace
sourceLabels:
- __metrics_path__
targetLabel: metrics_path
scheme: https
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecureSkipVerify: true
jobLabel: app.kubernetes.io/name
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
app.kubernetes.io/name: k3s
```
4) Add manually Grafana dashboards corresponding to K8s components (api-server, kubelet, proxy, etc.). They are not installed when disabling monitoring of k8s components in kube-prometheus-stack chart installation:
- kubelet dashboard: [ID 16361](https://grafana.com/grafana/dashboards/16361-kubernetes-kubelet/)
- apiserver dashboard [ID 12654](https://grafana.com/grafana/dashboards/12654-kubernetes-api-server)
- etcd dashboard [ID 16359](https://grafana.com/grafana/dashboards/16359-etcd/)
- kube-scheduler [ID 12130](https://grafana.com/grafana/dashboards/12130-kubernetes-scheduler/)
- kube-controller-manager [ID 12122](https://grafana.com/grafana/dashboards/12122-kubernetes-controller-manager)
- kube-proxy [ID 12129](https://grafana.com/grafana/dashboards/12129-kubernetes-proxy)
5) Add manually PrometheusRules
of the disabled components. Chart also does not install them when disabling its monitoring.
kube-prometheus-stack creates different PrometheusRules resources, but all of them are included in single manifest file in source repository (https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/kubernetesControlPlane-prometheusRule.yaml)
NOTE: Both PrometheusRules and Grafana Dashboards might need modifications. It includes metrics filtered by job label (kubelet, apiserver, etc.) and with the proposed solution only job label "k3s" will be used
Final solution setting job label to "kubelet" for all metrics scrapped for k3s components through kubelet port. This way only a few dashboards need to be changed. (kube-proxy, kube-controller-manager and apiserver).
Selecting a different name such as "k3s" (initial proposed solution) makes that all default kube-prometheus-stack dashboards using kubelet metrics (container metrics), need to be updated. For example: The following dashboards use "job=kubelet" when filtering the metrics. Kubernetes - Compute Resources /Cluster Kubernetes - Compute Resources / Namespace (Pods) Kubernetes - Compute Resources / Namespace (Workloads)
@ricsanfre First this repo and the accompanying website are awesome. Thanks for your efforts.
Regarding this issue I want to let you know that I've solved it in a little bit of a different manner that ensures that the kube-prometheus-stack
chart is still creating the rules and grafana dashboards thus eliminating the need to manually handle this step.
So instead of disabling all the components in the Helm chart I actually keep them enabled but instruct all but the kubelet
ServiceMonitor
to drop all the metrics they scrape
e.g. This is how I defined my kubeApiServer
section in my values.yaml file
kubeApiServer:
serviceMonitor:
metricRelabelings:
- action: drop
regex: .*
sourceLabels:
- __name__
I have a similar snippet for kubeControllerManager
, kubeProxy
, and kubeScheduler
With this the Chart is still creating the rules and dashboards without ingesting duplicate metrics. Only metrics from the kubelet
are kept.
Now the rules and dashboards created by the chart refer to a job that needs to be replaced with kubelet
so I make use of a very simple Argo CD Config Management Plugin.
In the init
command I use helm template
to generate the templates and then in the generate
command I run a couple of sed
commands that replace the job values with kubelet
.
The end result is
kubelet
endpoint)The only drawback is that although Prometheus doesn't ingest duplicate metrics it still ends up scraping multiple end points and dropping the metrics from these endpoints which of course means relatively higher CPU and memory usage.
One idea that just occurred to me to address the drawback is to set the interval
of the ServiceMonitor to a very high value thus technically preventing Prometheus from even scraping the end points.
@sherif-fanous, thank you so much for sharing your ideas.
Would it be possible to share your values.yaml
and especially a small example how to run the sed
commands with the Config Management Plugin?
The relevant sections of my values.yaml. Keep in mind this is a k3s single node cluster running on TrueNAS Scale. You might have a slightly different setup than mine especially regarding etcd
and kube-proxy
kubeApiServer:
serviceMonitor:
interval: 1d
metricRelabelings:
- action: drop
regex: .*
sourceLabels:
- __name__
kubeControllerManager:
endpoints:
- 192.168.4.59
serviceMonitor:
https: true
insecureSkipVerify: true
interval: 1d
metricRelabelings:
- action: drop
regex: .*
sourceLabels:
- __name__
kubeEtcd:
enabled: false
kubelet:
serviceMonitor:
metricRelabelings:
- action: drop
regex: apiserver_request_duration_seconds_bucket;(0.15|0.2|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2|3|3.5|4|4.5|6|7|8|9|15|25|40|50)
sourceLabels:
- __name__
- le
kubeProxy:
enabled: false
kubeScheduler:
endpoints:
- 192.168.4.59
serviceMonitor:
https: true
insecureSkipVerify: true
interval: 1d
metricRelabelings:
- action: drop
regex: .*
sourceLabels:
- __name__
The sed
command is in the Argo CD Application manifest. Here's what it looks like
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
annotations:
argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
argocd.argoproj.io/sync-wave: '32'
finalizers:
- resources-finalizer.argocd.argoproj.io
name: kube-prometheus-stack
namespace: argo-cd
spec:
destination:
namespace: kube-prometheus-stack
server: https://kubernetes.default.svc
project: default
source:
chart: kube-prometheus-stack
repoURL: https://prometheus-community.github.io/helm-charts
targetRevision: 58.2.1
sources:
- chart: kube-prometheus-stack
plugin:
name: config-management-plugin-template
parameters:
- name: generate-command
string: >-
sed -E -i 's/job="(apiserver|kube-scheduler|kube-controller-manager)"/job="kubelet"/g' ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml && sed -E -i 's/job=\\"(apiserver|kube-scheduler|kube-controller-manager)\\"/job=\\"kubelet\\"/g' ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml && sed -E -i 's/sum\(up\{cluster=\\"\$cluster\\", job=\\"kubelet\\"\}\)/sum\(up\{cluster=\\"\$cluster\\",job=\\"kubelet\\", metrics_path=\\"\/metrics\\"\}\)/g' ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml && cat ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml
- name: init-command
string: >-
mkdir -p ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/ && helm template . --create-namespace --namespace prometheus-stack --values ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/helm/values/base/helm-kube-prometheus-stack-values.yaml --values ./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/helm/values/overlays/truenas-mini-x-plus/helm-kube-prometheus-stack-values.yaml >
./ifanous/home-lab/kubernetes/apps/kube-prometheus-stack/template/truenas-mini-x-plus/all.yaml
repoURL: https://prometheus-community.github.io/helm-charts
targetRevision: 58.2.1
- path: kubernetes/apps/kube-prometheus-stack/kustomize/overlays/truenas-mini-x-plus
repoURL: git@github.com:ifanous/home-lab.git
targetRevision: HEAD
- ref: root
repoURL: git@github.com:ifanous/home-lab.git
syncPolicy:
automated:
prune: true
selfHeal: true
retry:
backoff:
duration: 5s
factor: 2
maxDuration: 3m
limit: 5
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
P.S. My repo is private so you won't be able to access it but still everything you need is in this thread, just replace every reference to my repo with yours.
You also need to setup Argo CD to use a CMP plugin. At a high level here's what I'm doing in my Argo CD values.yaml
configs:
cmp:
create: true
plugins:
config-management-plugin-template:
generate:
args:
- |
echo "Starting generate phase for application $ARGOCD_APP_NAME" 1>&2;
echo "Executing $PARAM_GENERATE_COMMAND" 1>&2;
eval $PARAM_GENERATE_COMMAND;
echo "Successfully completed generate phase for application $ARGOCD_APP_NAME" 1>&2;
command: [/bin/sh, -c]
init:
args:
- |
echo "Starting init phase for application $ARGOCD_APP_NAME" 1>&2;
echo "Starting a partial treeless clone of repo ifanous/home-lab.git" 1>&2; mkdir ifanous 1>&2; cd ifanous 1>&2; git clone -n --depth=1 --filter=tree:0 https://$IFANOUS_HOME_LAB_HTTPS_USERNAME:$IFANOUS_HOME_LAB_HTTPS_PASSWORD@github.com/ifanous/home-lab.git 1>&2; cd home-lab/ 1>&2; git sparse-checkout set --no-cone $ARGOCD_APP_NAME 1>&2; git checkout 1>&2;
echo "Successfully completed a partial treeless clone of repo ifanous/home-lab.git" 1>&2;
echo "Executing $PARAM_INIT_COMMAND" 1>&2;
cd ../../ 1>&2; eval $PARAM_INIT_COMMAND;
echo "Successfully completed init phase for application $ARGOCD_APP_NAME" 1>&2;
command: ["/bin/sh", "-c"]
repoServer:
extraContainers:
- args:
- '--logformat=json'
- '--loglevel=debug'
command:
- /var/run/argocd/argocd-cmp-server
env:
- name: IFANOUS_HOME_LAB_HTTPS_PASSWORD
valueFrom:
secretKeyRef:
key: password
name: argocd-repo-creds-ifanous-home-lab-https
- name: IFANOUS_HOME_LAB_HTTPS_USERNAME
valueFrom:
secretKeyRef:
key: username
name: argocd-repo-creds-ifanous-home-lab-https
image: alpine/k8s:1.29.2
name: config-management-plugin-template
resources:
limits:
memory: 512Mi
requests:
memory: 64Mi
securityContext:
runAsNonRoot: true
runAsUser: 999
volumeMounts:
- mountPath: /var/run/argocd
name: var-files
- mountPath: /home/argocd/cmp-server/plugins
name: plugins
- mountPath: /home/argocd/cmp-server/config/plugin.yaml
name: argocd-cmp-cm
subPath: config-management-plugin-template.yaml
- mountPath: /tmp
name: cmp-tmp
Thank you very much!
Bug Description
Kuberentes Documentation - System Metrics details which Kubernetes components expose metrics in Prometheus format:
These components are:
/metrics
endpoint at TCP 10257)/metrics
endpoint at TCP 10249)/metrics
at Kubernetes API port)/metrics
endpoint at TCP 10259)/metrics
,/metrics/cadvisor
,/metrics/resource
and/metrics/probes
endpoints at TCP 10250)K3S distribution has a special behavior since in each node only one process is deployed (
k3s-server
running on master nodes ork3s-agent
running on worker nodes) with all k8s components sharing the same memory.K3s is emitting the same metrics, from all k8s components deployed in the node, at all '/metrics' endpoints available (api-server, kubelet (TCP 10250), kube-proxy (TCP 10249), kube-scheduler (TCP 10251), kube-controller-manager (TCP 10257). Thus, collecting from all port produces metrics duplicates.
kubelet additional metrics (endpoints
/metrics/cadvisor
,/metrics/resource
and/metrics/probes
) are only available at TCP 10250.Enabling the scraping of all different metrics TCP ports (kubernetes components) causes the ingestion of duplicated metrics. Duplicated metrics in Prometheus need to be removed in order to reduce memory and CPU consumption.
Context Information
As stated in issue #22, there was a known issue in K3S: https://github.com/k3s-io/k3s/issues/2262, where duplicated metrics are emitted by the three components (kube-proxy, kube-scheduler and kube-controller-manager). The proposed solution by Rancher Monitoring(https://github.com/k3s-io/k3s/issues/2262), was to avoid the scrape of duplicated metrics and activate only the service monitoring of one of the components. (i.e. kube-proxy). That solution was implemented (see https://github.com/ricsanfre/pi-cluster/issues/22#issuecomment-986224709) and it solved the main issue (out-of-memory).
Endpoints currently being scrapped by Prometheus are
Duplicated metrics
After deeper analysis on the metrics scrapped by Prometheus, it is clear that K3S is emitting duplicated metrics in all endpoints.
Example 1: API-server metrics emitted by kube-proxy, kubelet and api-server endpoints running on master server
Example 2: kubelet metrics emitted by kube-proxy, kubelet and api-server
Example3: kubepoxy metrics:
kubeproxy_sync_proxy_rules_duration_seconds_bucket{le="0.001"}