Metrics server with errors in Kubernetes cluster (Wrong scrape ip address)

Vasiliy-Basov commented 9 months ago

The metrics server is experiencing errors

E0111 12:19:15.534824       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.101.3.75:10250/metrics/resource\": dial tcp 10.101.3.75:10250: connect: connection refused" node="kubws-vt01"
I0111 12:19:20.209395       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"

Warnings appear periodically.

invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu usage: unable to get metrics for resource cpu: no metrics returned from resource metrics API
Source horizontal-pod-autoscaler

failed to get cpu usage: unable to get metrics for resource cpu: no metrics returned from resource metrics API
Source horizontal-pod-autoscaler

It seems like the metrics server periodically attempts to fetch metrics through a different network adapter with different ip. How can I change this behavior and make it use the desired network adapter?

kubectl get nodes -o wide
NAME              STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
kubms-vt01        Ready    control-plane   57d   v1.28.3   172.18.27.52   <none>        Ubuntu 22.04.3 LTS   5.15.0-89-generic   containerd://1.7.8
kubms-vt02        Ready    control-plane   57d   v1.28.3   172.18.27.53   <none>        Ubuntu 22.04.3 LTS   5.15.0-89-generic   containerd://1.7.8
kubws-vt01        Ready    worker          57d   v1.28.3   172.18.27.55   <none>        Ubuntu 22.04.3 LTS   5.15.0-89-generic   containerd://1.7.8

sudo kubectl top nodes
NAME              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
kubms-vt01       164m         4%     3746Mi          31%
kubms-vt02       264m         6%     8504Mi          71%
kubws-vt01       231m         5%     8388Mi          70%

Deployment config:

spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/name: metrics-server
      version: v0.6.4
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/name: metrics-server
        version: v0.6.4
      name: metrics-server
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - metrics-server
              namespaces:
              - kube-system
              topologyKey: kubernetes.io/hostname
            weight: 100
      containers:
      - args:
        - --logtostderr
        - --cert-dir=/tmp
        - --secure-port=10250
        - --kubelet-preferred-address-types=InternalIP
        - --kubelet-use-node-status-port
        - --kubelet-insecure-tls=true
        - --metric-resolution=15s
        image: registry.k8s.io/metrics-server/metrics-server:v0.6.4
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          initialDelaySeconds: 40
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: metrics-server
        ports:
        - containerPort: 10250
          name: https
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readyz
            port: https
            scheme: HTTPS
          initialDelaySeconds: 40
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 100m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
          seccompProfile:
            type: RuntimeDefault
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp
          name: tmp
      dnsPolicy: ClusterFirst
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: metrics-server
      serviceAccountName: metrics-server
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/control-plane
      volumes:
      - emptyDir: {}
        name: tmp

I would like to understand why the metrics server is obtaining the incorrect IP address (address from another network adapter) even though the INTERNAL-IP is configured with the correct address.

I tried changing the --kubelet-preferred-address-types option to different values, but without success. Calico CNI is used

/kind support

yangjunmyfm192085 commented 9 months ago

I'm a little confused. Does the following error always exist?

E0111 12:19:15.534824 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.101.3.75:10250/metrics/resource\": dial tcp 10.101.3.75:10250: connect : connection refused" node="kubws-vt01"
I0111 12:19:20.209395 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"

If the error persists, use the kubectl top nodes command. There should be no following information.

kubws-vt01 231m 5% 8388Mi 70%

If you have obtained the metrics of node kubws-vt01 usingkubectl top nodes, it means that metrics-server is working normally.

Vasiliy-Basov commented 9 months ago

The error occurs not always but intermittently. I have multiple network adapters on the servers, and for some unknown reason, it sometimes tries to fetch metrics using the wrong adapter from another subnet. The question is, where does it retrieve these settings? Could the issue be related to kube-vip? Is it possible to specify a particular adapter for metric retrieval? Here is the log from the metric pod:

2024-01-22T11:48:43+03:00 I0122 08:48:43.894769       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-01-22T11:48:53+03:00 I0122 08:48:53.894926       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-01-22T11:49:03+03:00 I0122 08:49:03.894927       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-01-22T11:49:03+03:00 I0122 08:49:03.900175       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-01-22T11:53:34+03:00 E0122 08:53:34.841034       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.101.3.74:10250/metrics/resource\": dial tcp 10.101.3.74:10250: connect: connection refused" node="kubms-vt01"
2024-01-22T11:53:34+03:00 E0122 08:53:34.853509       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.101.3.76:10250/metrics/resource\": dial tcp 10.101.3.76:10250: connect: connection refused" node="kubms-vt02"
2024-01-22T16:08:49+03:00 E0122 13:08:49.844596       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.101.3.74:10250/metrics/resource\": dial tcp 10.101.3.74:10250: connect: connection refused" node="kubms-vt01"
2024-01-22T16:13:49+03:00 E0122 13:13:49.850523       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.101.3.75:10250/metrics/resource\": dial tcp 10.101.3.75:10250: connect: connection refused" node="kubws-vt01"

dgrisonnet commented 9 months ago

/triage accepted /assign @yangjunmyfm192085

Vasiliy-Basov commented 9 months ago

"I suspect that the problem is related to Calico, probably with this daemonset setting:

        - name: NODEIP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: IP_AUTODETECTION_METHOD
          value: can-reach=$(NODEIP)
        - name: IP
          value: autodetect

calicoctl get nodes -o wide
NAME              ASN       IPV4             IPV6   
kubms-vt01   (64512)   172.18.27.52/23    
kubms-vt02   (64512)   10.101.3.76/24     
kubws-vt01   (64512)   172.18.27.55/23

After reconfiguring the network on the nodes via netplan, disabling adapters from the 10th subnet, and restarting Calico pods, the errors are no longer present."

yangjunmyfm192085 commented 9 months ago

"I suspect that the problem is related to Calico, probably with this daemonset setting:
        - name: NODEIP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: IP_AUTODETECTION_METHOD
          value: can-reach=$(NODEIP)
        - name: IP
          value: autodetect
calicoctl get nodes -o wide
NAME              ASN       IPV4             IPV6   
kubms-vt01   (64512)   172.18.27.52/23    
kubms-vt02   (64512)   10.101.3.76/24     
kubws-vt01   (64512)   172.18.27.55/23
After reconfiguring the network on the nodes via netplan, disabling adapters from the 10th subnet, and restarting Calico pods, the errors are no longer present."

Does this display occasionally appear when executing the kubectl get nodes -o wide command?

Vasiliy-Basov commented 9 months ago

When running kubectl get nodes -o wide, I didn't see such output. INTERNAL-IP addresses were in the correct subnet.

kubernetes-sigs / metrics-server

Metrics server with errors in Kubernetes cluster (Wrong scrape ip address) #1408