kubernetes-sigs / metrics-server

Scalable and efficient source of container resource metrics for Kubernetes built-in autoscaling pipelines.
https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/
Apache License 2.0
5.77k stars 1.86k forks source link

EKS Fargate Matrics-server fails to scrape itself #1422

Open Paddy-CH opened 7 months ago

Paddy-CH commented 7 months ago

What happened: Logs from the matrics-server pod show this repeatedly E0216 11:45:59.265624 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.6.194.69:10250/metrics/resource\": dial tcp 10.6.194.69:10250: connect: connection refused" node="fargate-ip-10-6-194-69.eu-west-2.compute.internal"

What you expected to happen: To be able to scrape itself.

Anything else we need to know?: The secure port and container port are set to 4443. If I change it to 10250 as the call requires the error changes to 'forbidden'. I also get 'error: Metrics API not available' from kubectl when I try to access it.

Environment:

spoiler for Metrics Server manifest: apiVersion: v1 kind: ServiceAccount metadata: labels: k8s-app: metrics-server name: metrics-server namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: k8s-app: metrics-server rbac.authorization.k8s.io/aggregate-to-admin: "true" rbac.authorization.k8s.io/aggregate-to-edit: "true" rbac.authorization.k8s.io/aggregate-to-view: "true" name: system:aggregated-metrics-reader rules: - apiGroups: - metrics.k8s.io resources: - pods - nodes verbs: - get - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: k8s-app: metrics-server name: system:metrics-server rules: - apiGroups: - "" resources: - nodes/metrics verbs: - get - apiGroups: - "" resources: - pods - nodes verbs: - get - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: labels: k8s-app: metrics-server name: metrics-server-auth-reader namespace: kube-system roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: extension-apiserver-authentication-reader subjects: - kind: ServiceAccount name: metrics-server namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: k8s-app: metrics-server name: metrics-server:system:auth-delegator roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:auth-delegator subjects: - kind: ServiceAccount name: metrics-server namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: k8s-app: metrics-server name: system:metrics-server roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:metrics-server subjects: - kind: ServiceAccount name: metrics-server namespace: kube-system --- apiVersion: v1 kind: Service metadata: labels: k8s-app: metrics-server name: metrics-server namespace: kube-system spec: ports: - name: https port: 443 protocol: TCP targetPort: https selector: k8s-app: metrics-server --- apiVersion: apps/v1 kind: Deployment metadata: labels: k8s-app: metrics-server name: metrics-server namespace: kube-system spec: selector: matchLabels: k8s-app: metrics-server strategy: rollingUpdate: maxUnavailable: 0 template: metadata: labels: k8s-app: metrics-server spec: containers: - args: - --cert-dir=/tmp - --secure-port=4443 - --kubelet-insecure-tls - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname - --kubelet-use-node-status-port - --metric-resolution=15s command: - /metrics-server - --kubelet-insecure-tls - --kubelet-preferred-address-types=InternalIP image: registry.k8s.io/metrics-server/metrics-server:v0.7.0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: /livez port: https scheme: HTTPS periodSeconds: 10 name: metrics-server ports: - containerPort: 4443 name: https protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /readyz port: https scheme: HTTPS initialDelaySeconds: 20 periodSeconds: 10 resources: requests: cpu: 100m memory: 200Mi limits: cpu: 100m memory: 200Mi securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true runAsNonRoot: true runAsUser: 1000 volumeMounts: - mountPath: /tmp name: tmp-dir nodeSelector: kubernetes.io/os: linux priorityClassName: system-cluster-critical serviceAccountName: metrics-server volumes: - emptyDir: {} name: tmp-dir --- apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: labels: k8s-app: metrics-server name: v1beta1.metrics.k8s.io spec: group: metrics.k8s.io groupPriorityMinimum: 100 insecureSkipTLSVerify: true service: name: metrics-server namespace: kube-system version: v1beta1 versionPriority: 100
spoiler for Kubelet config:
spoiler for Metrics Server logs:
spolier for Status of Metrics API: ```sh kubectl describe apiservice v1beta1.metrics.k8s.io ```

/kind bug

yangjunmyfm192085 commented 7 months ago

"Failed to scrape node" err="Get "[https://10.6.194.69:10250/metrics/resource\](https://10.6.194.69:10250/metrics/resource%5C)": This error represents an exception when metrics-server accesses the metrics/resource endpoint of kubelet. Please check whether the firewall blocks access to the kubelet 10250 port, or is the kubelet listening port not 10250?

Paddy-CH commented 7 months ago

Hi, Initially I had it set to 4443. When I saw the error I changed it to 10250. When I did that the error changed to a 'forbidden' error when trying to scrape itself, also when I tried kubectl I got 'Metrics API not available'

yangjunmyfm192085 commented 7 months ago

Could you use the command kubectl get node fargate-ip-10-6-194-69.eu-west-2.compute.internal -oyaml to check the value of kubeletEndpoint?

Paddy-CH commented 7 months ago

It returns daemonEndpoints: kubeletEndpoint: Port: 10250

yangjunmyfm192085 commented 7 months ago

Hi, @Paddy-CH, It looks like the EKS environment, metrics-server cannot access the kubelet's 10250 port normally. This should not be a issue with metrics-server. Please also check the security policy of the environment?

yangjunmyfm192085 commented 7 months ago

/kind support

yangjunmyfm192085 commented 7 months ago

/remove-kind bug

dashpole commented 7 months ago

/assign @yangjunmyfm192085 /triage accepted

honarkhah commented 5 months ago

Related to https://github.com/aws/containers-roadmap/issues/1798