kubernetes-sigs / metrics-server

Scalable and efficient source of container resource metrics for Kubernetes built-in autoscaling pipelines.
https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/
Apache License 2.0
5.85k stars 1.88k forks source link

metrics-server fails to get metrics from recently started nodes #1571

Open IuryAlves opened 2 months ago

IuryAlves commented 2 months ago

What happened:

The metrics-server fails with error: E0916 14:23:37.254021 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.34.50.99:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-50-99.eu-west-1.compute.internal"

What you expected to happen:

The metrics-server can get metrics from nodes successfully.

Anything else we need to know?:

This problem happens when the autoscaler (we use Karpenter) adds or removes new nodes. For a brief period of time the node will fail to report metrics in the metrics/resource, causing the HPA to have many FailedToGetResourceMetric events.

Environment:

Note: This issue is not network related

Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.12-eks-2f46c53
spoiler for Metrics Server manifest: ```yaml apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "6" meta.helm.sh/release-name: metrics-server meta.helm.sh/release-namespace: kube-system creationTimestamp: "2023-01-31T14:48:01Z" generation: 6 labels: app.kubernetes.io/instance: metrics-server app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: metrics-server app.kubernetes.io/version: 0.7.2 application: none customer-level: none environment: dev helm.sh/chart: metrics-server-7.2.14 owner: squad-platform name: metrics-server namespace: kube-system resourceVersion: "386279220" uid: ed6fd84e-fdb6-46e5-b25c-80a09135476f spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app.kubernetes.io/instance: metrics-server app.kubernetes.io/name: metrics-server strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: kubectl.kubernetes.io/restartedAt: "2023-07-17T14:25:57+02:00" prometheus.io/path: /metrics prometheus.io/port: "8443" prometheus.io/scrape: "true" creationTimestamp: null labels: app.kubernetes.io/instance: metrics-server app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: metrics-server app.kubernetes.io/version: 0.7.2 application: none customer-level: none environment: dev helm.sh/chart: metrics-server-7.2.14 owner: squad-platform spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: app.kubernetes.io/instance: metrics-server app.kubernetes.io/name: metrics-server topologyKey: kubernetes.io/hostname weight: 1 automountServiceAccountToken: true containers: - args: - --secure-port=8443 - --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP - --metric-resolution=20s command: - metrics-server image: docker.io/bitnami/metrics-server:0.7.2-debian-12-r3 imagePullPolicy: Always livenessProbe: failureThreshold: 3 httpGet: path: /livez port: https scheme: HTTPS periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: metrics-server ports: - containerPort: 8443 name: https protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /readyz port: https scheme: HTTPS periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: limits: cpu: 150m ephemeral-storage: 2Gi memory: 192Mi requests: cpu: 100m ephemeral-storage: 50Mi memory: 128Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL privileged: false readOnlyRootFilesystem: true runAsGroup: 1001 runAsNonRoot: true runAsUser: 1001 seLinuxOptions: {} seccompProfile: type: RuntimeDefault terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /tmp name: empty-dir subPath: tmp-dir - mountPath: /opt/bitnami/metrics-server/apiserver.local.config name: empty-dir subPath: app-tmp-dir dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 1001 fsGroupChangePolicy: Always serviceAccount: metrics-server serviceAccountName: metrics-server terminationGracePeriodSeconds: 30 volumes: - emptyDir: {} name: empty-dir status: availableReplicas: 1 conditions: - lastTransitionTime: "2023-01-31T14:48:01Z" lastUpdateTime: "2024-09-16T18:56:36Z" message: ReplicaSet "metrics-server-75fb4689b7" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing - lastTransitionTime: "2024-09-17T06:56:46Z" lastUpdateTime: "2024-09-17T06:56:46Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available observedGeneration: 6 readyReplicas: 1 replicas: 1 updatedReplicas: 1 ```
spoiler for Kubelet config: ```json { "kubeletconfig": { "enableServer": true, "syncFrequency": "1m0s", "fileCheckFrequency": "20s", "httpCheckFrequency": "20s", "address": "0.0.0.0", "port": 10250, "tlsCipherSuites": [ "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305", "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_RSA_WITH_AES_256_GCM_SHA384", "TLS_RSA_WITH_AES_128_GCM_SHA256" ], "serverTLSBootstrap": true, "authentication": { "x509": { "clientCAFile": "/etc/kubernetes/pki/ca.crt" }, "webhook": { "enabled": true, "cacheTTL": "2m0s" }, "anonymous": { "enabled": false } }, "authorization": { "mode": "Webhook", "webhook": { "cacheAuthorizedTTL": "5m0s", "cacheUnauthorizedTTL": "30s" } }, "registryPullQPS": 5, "registryBurst": 10, "eventRecordQPS": 50, "eventBurst": 100, "enableDebuggingHandlers": true, "healthzPort": 10248, "healthzBindAddress": "127.0.0.1", "oomScoreAdj": -999, "clusterDomain": "cluster.local", "clusterDNS": [ "172.31.0.10" ], "streamingConnectionIdleTimeout": "4h0m0s", "nodeStatusUpdateFrequency": "10s", "nodeStatusReportFrequency": "5m0s", "nodeLeaseDurationSeconds": 40, "imageMinimumGCAge": "2m0s", "imageGCHighThresholdPercent": 85, "imageGCLowThresholdPercent": 80, "volumeStatsAggPeriod": "1m0s", "cgroupRoot": "/", "cgroupsPerQOS": true, "cgroupDriver": "systemd", "cpuManagerPolicy": "none", "cpuManagerReconcilePeriod": "10s", "memoryManagerPolicy": "None", "topologyManagerPolicy": "none", "topologyManagerScope": "container", "runtimeRequestTimeout": "2m0s", "hairpinMode": "hairpin-veth", "maxPods": 58, "podPidsLimit": -1, "resolvConf": "/etc/resolv.conf", "cpuCFSQuota": true, "cpuCFSQuotaPeriod": "100ms", "nodeStatusMaxImages": 50, "maxOpenFiles": 1000000, "contentType": "application/vnd.kubernetes.protobuf", "kubeAPIQPS": 50, "kubeAPIBurst": 100, "serializeImagePulls": false, "evictionHard": { "memory.available": "100Mi", "nodefs.available": "10%", "nodefs.inodesFree": "5%" }, "evictionPressureTransitionPeriod": "5m0s", "enableControllerAttachDetach": true, "protectKernelDefaults": true, "makeIPTablesUtilChains": true, "iptablesMasqueradeBit": 14, "iptablesDropBit": 15, "featureGates": { "RotateKubeletServerCertificate": true }, "failSwapOn": true, "memorySwap": {}, "containerLogMaxSize": "10Mi", "containerLogMaxFiles": 5, "configMapAndSecretChangeDetectionStrategy": "Watch", "kubeReserved": { "cpu": "90m", "ephemeral-storage": "1Gi", "memory": "893Mi" }, "systemReservedCgroup": "/system", "kubeReservedCgroup": "/runtime", "enforceNodeAllocatable": [ "pods" ], "volumePluginDir": "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/", "providerID": "aws:///eu-west-1a/i-02487199a514a5c47", "logging": { "format": "text", "flushFrequency": "5s", "verbosity": 2, "options": { "json": { "infoBufferSize": "0" } } }, "enableSystemLogHandler": true, "enableSystemLogQuery": false, "shutdownGracePeriod": "0s", "shutdownGracePeriodCriticalPods": "0s", "enableProfilingHandler": true, "enableDebugFlagsHandler": true, "seccompDefault": false, "memoryThrottlingFactor": 0.9, "registerNode": true, "localStorageCapacityIsolation": true, "containerRuntimeEndpoint": "unix:///run/containerd/containerd.sock" } } ```
spoiler for Metrics Server logs: ``` I0917 06:56:25.141023 1 serving.go:374] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key) I0917 06:56:29.056085 1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager I0917 06:56:29.260891 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController I0917 06:56:29.260914 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController I0917 06:56:29.260949 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file" I0917 06:56:29.260975 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0917 06:56:29.260993 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" I0917 06:56:29.260998 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I0917 06:56:29.261280 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::apiserver.local.config/certificates/apiserver.crt::apiserver.local.config/certificates/apiserver.key" I0917 06:56:29.261308 1 secure_serving.go:213] Serving securely on [::]:8443 I0917 06:56:29.261348 1 tlsconfig.go:240] "Starting DynamicServingCertificateController" I0917 06:56:29.541673 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file I0917 06:56:29.543057 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file I0917 06:56:29.543068 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController te error: tls: internal error" node="ip-10-34-40-218.eu-west-1.compute.internal" E0913 07:11:41.371925 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.36.55:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-36-55.eu-west-1.compute.internal" E0913 07:15:41.362052 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.37.167:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-37-167.eu-west-1.compute.internal" E0913 07:19:41.367676 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.34.76:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-34-76.eu-west-1.compute.internal" E0913 07:23:41.376918 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.43.160:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-43-160.eu-west-1.compute.internal" E0913 07:26:41.376301 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.41.241:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-41-241.eu-west-1.compute.internal" E0913 07:30:41.339715 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.44.219:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-44-219.eu-west-1.compute.internal" E0913 07:33:41.354489 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.47.203:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-47-203.eu-west-1.compute.internal" E0913 07:38:41.359856 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.45.209:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-45-209.eu-west-1.compute.internal" E0913 07:40:41.350880 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.44.2:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-44-2.eu-west-1.compute.internal" E0913 07:42:41.372912 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.41.99:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-41-99.eu-west-1.compute.internal" E0913 07:45:41.374758 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.41.27:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-41-27.eu-west-1.compute.internal" ```
spolier for Status of Metrics API: ``` Name: v1beta1.metrics.k8s.io Namespace: Labels: app.kubernetes.io/instance=metrics-server app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=metrics-server app.kubernetes.io/version=0.7.2 application=none customer-level=none environment=dev helm.sh/chart=metrics-server-7.2.14 owner=squad-platform Annotations: meta.helm.sh/release-name: metrics-server meta.helm.sh/release-namespace: kube-system API Version: apiregistration.k8s.io/v1 Kind: APIService Metadata: Creation Timestamp: 2023-01-31T14:48:01Z Resource Version: 386279218 UID: 7503894f-8f1f-4f61-9df8-a663cdd0298d Spec: Group: metrics.k8s.io Group Priority Minimum: 100 Insecure Skip TLS Verify: true Service: Name: metrics-server Namespace: kube-system Port: 443 Version: v1beta1 Version Priority: 100 Status: Conditions: Last Transition Time: 2024-09-17T06:56:46Z Message: all checks passed Reason: Passed Status: True Type: Available Events: ```

/kind bug

dgrisonnet commented 2 months ago

/triage accepted /kind support