Open dkelim1 opened 7 months ago
Hi honarkhah,
So meaning this is related to the above issue and there is no fix or workaround until #1798 is fixed?
Or we can only integrate VPA or HPA with OpenTelemetry Collector or Prometheus? Thanks.
/kind support /triage accepted
Hello,
We are having this same issue. The only workaround I have found is to run metrics-server on ec2 rather then Fargate. When running metrics-server on ec2, there are no issues or errors seen in the logs.
same question for k3s
# kubectl logs -n kube-system metrics-server-79f66dff9d-5sflh --tail 300 -f
Error from server: Get "https://10.1.4.13:10250/containerLogs/kube-system/metrics-server-79f66dff9d-5sflh/metrics-server?follow=true&tailLines=300": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.1.4.13
facing the same for ec2? any workaround?
I had the same issue while I was upgrading my cluster, and the issue practically solved itself.
Tl;dr: check if you have any other clusters using similar fargate profiles, and if they're on different versions, upgrade them to match. I had dev and prod. Even though they're completely different clusters (with their own nodes and fargate profiles), they broke each other (and fixed each other). I'm still not quite sure what caused my issue, but I'm leaving my story below in case it helps someone else.
long version:
I have two clusters: dev & prod, both were on 1.25 using Fargate, with a working metrics-server (been working fine since 2020), and I needed to bring both to latest EKS. Yesterday, I initially updated my dev cluster to 1.26. I then restarted all deployments. I noticed that after the update, my nodes' kubelet remained on 1.25, even after several restarts, which was odd. I decided to update my addons, so I updated my coredns to v1.9.3-eksbuild.11
, and then also decided to update my metrics-server. That's when I noticed that the restarted metrics-server is unable to scrape itself. I spent 2 hours trying to understand why (how I found this issue). I thought maybe it was the version of the metrics-server, so I downgraded back to 3.8 (from 3.12), but it was still broken, unable to scrape itself.
It was odd, because the issue came out of nowhere; I never had issues with the metrics-server on fargate before and it was always able to scrape fargate nodes. This was obviously preventing me from upgrading prod. I couldn't upgrade prod if the broken metrics-server was caused by the upgrade. I decided to see how the metrics-server was working on fargate on prod. That's when I was surprised to see that it was also broken on prod! Baffled, because the two clusters and their fargate profiles are (should be?!) completely separate. I checked prod's version, and it was still 1.25 as expected. For some reason, all my fargate nodes (on prod) had restarted, and that's when my metrics-server problems started. I decided to go ahead and update prod to 1.26, and voila, metrics server suddenly started working on both clusters, dev and prod. Still not sure why...
I've now upgraded dev and prod to 1.29, and metrics-server is still working well.
I'm using AWS EKS Fargate V1.29
After I lowered the metrics-server version to V0.6.4, it worked normally
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
k8s-app: metrics-server
name: metrics-server
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
k8s-app: metrics-server
rbac.authorization.k8s.io/aggregate-to-admin: "true"
rbac.authorization.k8s.io/aggregate-to-edit: "true"
rbac.authorization.k8s.io/aggregate-to-view: "true"
name: system:aggregated-metrics-reader
rules:
- apiGroups:
- metrics.k8s.io
resources:
- pods
- nodes
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
k8s-app: metrics-server
name: system:metrics-server
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
verbs:
- get
- apiGroups:
- ""
resources:
- pods
- nodes
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
k8s-app: metrics-server
name: metrics-server-auth-reader
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
k8s-app: metrics-server
name: metrics-server:system:auth-delegator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:auth-delegator
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
k8s-app: metrics-server
name: system:metrics-server
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:metrics-server
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: metrics-server
name: metrics-server
namespace: kube-system
spec:
ports:
- name: https
port: 443
protocol: TCP
targetPort: https
selector:
k8s-app: metrics-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: metrics-server
name: metrics-server
namespace: kube-system
spec:
selector:
matchLabels:
k8s-app: metrics-server
strategy:
rollingUpdate:
maxUnavailable: 0
template:
metadata:
labels:
k8s-app: metrics-server
spec:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=4443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
image: registry.k8s.io/metrics-server/metrics-server:v0.6.4
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /livez
port: https
scheme: HTTPS
periodSeconds: 10
name: metrics-server
ports:
- containerPort: 4443
name: https
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readyz
port: https
scheme: HTTPS
initialDelaySeconds: 20
periodSeconds: 10
resources:
requests:
cpu: 100m
memory: 200Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
volumeMounts:
- mountPath: /tmp
name: tmp-dir
nodeSelector:
kubernetes.io/os: linux
priorityClassName: system-cluster-critical
serviceAccountName: metrics-server
volumes:
- emptyDir: {}
name: tmp-dir
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
labels:
k8s-app: metrics-server
name: v1beta1.metrics.k8s.io
spec:
group: metrics.k8s.io
groupPriorityMinimum: 100
insecureSkipTLSVerify: true
service:
name: metrics-server
namespace: kube-system
version: v1beta1
versionPriority: 100
What happened:
Logs from the matrics-server pod show this repeatedly E0410 22:04:01.247686 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.124.4.238:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-10-124-4-238.ap-southeast-1.compute.internal" E0410 22:04:16.201141 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.124.4.238:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-10-124-4-238.ap-southeast-1.compute.internal" E0410 22:04:31.201853 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.124.4.238:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-10-124-4-238.ap-southeast-1.compute.internal" E0410 22:04:46.277913 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.124.4.238:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-10-124-4-238.ap-southeast-1.compute.internal"
What you expected to happen: To be able to scrape itself.
Anything else we need to know?:
1) Initially was using the metrics server that came with vpa. Errors similar to the above appears.
2) But later switch to the metrics server that is directly installed from eks_blueprints_kubernetes_addons. Errors similar to the above appears.
3) Tried to upgrade metrics server from version 0.6.x to 0.7.x. Errors similar to the above appears.
4) Tried to use by pass the certificate check by passing in the ‘--kubelet-insecure-tls’
However, the follow errors appear.
E0410 22:13:28.928630 1 scraper.go:149] "Failed to scrape node" err="request failed, status: \"403 Forbidden\"" node="fargate-ip-10-124-4-186.ap-southeast-1.compute.internal" E0410 22:13:43.827793 1 scraper.go:149] "Failed to scrape node" err="request failed, status: \"403 Forbidden\"" node="fargate-ip-10-124-4-186.ap-southeast-1.compute.internal"
Environment:
spoiler for Metrics Server manifest:
spoiler for Metrics Server manifest: apiVersion: v1 kind: ServiceAccount metadata: annotations: meta.helm.sh/release-name: metrics-server meta.helm.sh/release-namespace: kube-system creationTimestamp: "2024-04-10T21:48:44Z" labels: app.kubernetes.io/instance: metrics-server app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: metrics-server app.kubernetes.io/version: 0.7.1 helm.sh/chart: metrics-server-3.12.1 name: metrics-server namespace: kube-system resourceVersion: "1044967" uid: bbd89fdf-d933-4fd3-9bfa-2c8351bc9159 apiVersion: v1 kind: Service metadata: annotations: meta.helm.sh/release-name: metrics-server meta.helm.sh/release-namespace: kube-system creationTimestamp: "2024-04-10T21:48:44Z" labels: app.kubernetes.io/instance: metrics-server app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: metrics-server app.kubernetes.io/version: 0.7.1 helm.sh/chart: metrics-server-3.12.1 name: metrics-server namespace: kube-system resourceVersion: "1044976" uid: fe68eb2b-9ecf-4c57-996e-6836955f614c spec: clusterIP: 172.20.20.200 clusterIPs: - 172.20.20.200 internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: https port: 443 protocol: TCP targetPort: https selector: app.kubernetes.io/instance: metrics-server app.kubernetes.io/name: metrics-server sessionAffinity: None type: ClusterIP status: loadBalancer: {} apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "3" meta.helm.sh/release-name: metrics-server meta.helm.sh/release-namespace: kube-system creationTimestamp: "2024-04-10T21:48:44Z" generation: 3 labels: app.kubernetes.io/instance: metrics-server app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: metrics-server app.kubernetes.io/version: 0.7.1 helm.sh/chart: metrics-server-3.12.1 name: metrics-server namespace: kube-system resourceVersion: "1048455" uid: 51c7e198-d10b-4ec4-b96d-69e151de778b spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app.kubernetes.io/instance: metrics-server app.kubernetes.io/name: metrics-server strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: app.kubernetes.io/instance: metrics-server app.kubernetes.io/name: metrics-server spec: containers: - args: - --secure-port=10250 - --cert-dir=/tmp - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname - --kubelet-use-node-status-port - --metric-resolution=15s - --kubelet-insecure-tls image: registry.k8s.io/metrics-server/metrics-server:v0.7.1 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: /livez port: https scheme: HTTPS periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: metrics-server ports: - containerPort: 10250 name: https protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /readyz port: https scheme: HTTPS initialDelaySeconds: 20 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: requests: cpu: 100m memory: 200Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true runAsNonRoot: true runAsUser: 1000 seccompProfile: type: RuntimeDefault terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /tmp name: tmp dnsPolicy: ClusterFirst priorityClassName: system-cluster-critical restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: metrics-server serviceAccountName: metrics-server terminationGracePeriodSeconds: 30 volumes: - emptyDir: {} name: tmp status: availableReplicas: 1 conditions: - lastTransitionTime: "2024-04-10T21:50:06Z" lastUpdateTime: "2024-04-10T21:50:06Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available - lastTransitionTime: "2024-04-10T21:48:44Z" lastUpdateTime: "2024-04-10T22:13:47Z" message: ReplicaSet "metrics-server-578bc9bf64" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing observedGeneration: 3 readyReplicas: 1 replicas: 1 updatedReplicas: 1 - apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: annotations: meta.helm.sh/release-name: metrics-server meta.helm.sh/release-namespace: kube-system creationTimestamp: "2024-04-10T21:48:44Z" labels: app.kubernetes.io/instance: metrics-server app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: metrics-server app.kubernetes.io/version: 0.7.1 helm.sh/chart: metrics-server-3.12.1 name: v1beta1.metrics.k8s.io resourceVersion: "1048453" uid: 84cc08c7-27bc-4a4e-a7b8-efcd7b428ea2 spec: group: metrics.k8s.io groupPriorityMinimum: 100 insecureSkipTLSVerify: true service: name: metrics-server namespace: kube-system port: 443 version: v1beta1 versionPriority: 100 status: conditions: - lastTransitionTime: "2024-04-10T21:48:44Z" message: 'failing or missing response from https://10.124.4.186:10250/apis/metrics.k8s.io/v1beta1: bad status from https://10.124.4.186:10250/apis/metrics.k8s.io/v1beta1: 404' reason: FailedDiscoveryCheck status: "False" type: Availablespoiler for Kubelet config:
spoiler for Metrics Server logs:
spolier for Status of Metrics API:
```sh kubectl describe apiservice v1beta1.metrics.k8s.io ```/kind bug