kubernetes-sigs / metrics-server

Scalable and efficient source of container resource metrics for Kubernetes built-in autoscaling pipelines.
https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/
Apache License 2.0
5.82k stars 1.87k forks source link

[EKS] unable to fetch metrics from Kubelet #129

Closed sc-rz closed 6 years ago

sc-rz commented 6 years ago

Hi,

I am testing the recently released HPA on Amazon's EKS but running into an issue where it's failing to ping the node.

(actual IP redacted)

$ kubectl logs -l app=metrics-server -n kube-system
...
E0901 04:09:10.815694       1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-aa-bb-cc-dd.ec2.internal: unable to fetch metrics from Kubelet ip-aa-bb-cc-dd.ec2.internal (ip-aa-bb-cc-dd.ec2.internal): Get https://ip-aa-bb-cc-dd.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-aa-bb-cc-dd.ec2.internal on 10.100.0.10:53: no such host, unable to fully scrape metrics from source 
$ kubectl get nodes
NAME                             STATUS    ROLES     AGE       VERSION
ip-aa-bb-cc-dd.ec2.internal   Ready     <none>    1h        v1.10.3
$ kubectl describe node 
...
Addresses:
  InternalIP:  aa.bb.cc.dd
  Hostname:    ip-aa-bb-cc-dd.ec2.internal

I am using v0.3 after running kubectl apply -f metrics-server/deploy/1.8+/ on commit 931ef8402ac7e9545156041e4479a02b055c0ab4

Do i need to configure something?

Thanks

sc-rz commented 6 years ago

Nevermind, this was an issue with my VPC DNS resolution

dijeesh commented 6 years ago

Same here,

I have manually set Image to metrics-server-amd64:v0.3.0 in metrics-server-deployment.yaml and deployed.

But,

`` kubectl logs metrics-server-754478c688-j5ckq -n kube-system
I0901 03:49:30.403514       1 serving.go:273] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
W0901 03:49:30.723508       1 authentication.go:166] cluster doesn't provide client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication to extension api-server won't work.
W0901 03:49:30.732733       1 authentication.go:210] cluster doesn't provide client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication to extension api-server won't work.
[restful] 2018/09/01 03:49:30 log.go:33: [restful/swagger] listing is available at https://:443/swaggerapi
[restful] 2018/09/01 03:49:30 log.go:33: [restful/swagger] https://:443/swaggerui/ is mapped to folder /swagger-ui/
I0901 03:49:30.778391       1 serve.go:96] Serving securely on [::]:443

And HPA is still showing

Warning FailedGetResourceMetric 4m (x191 over 1h) horizontal-pod-autoscaler unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)

sc-rz commented 6 years ago

I am also still unable to get HPA working. I ran kubectl describe apiservice v1beta1.metrics.k8s.io and am having the same errors as in https://github.com/kubernetes-incubator/metrics-server/issues/45

sc-rz commented 6 years ago

Figured out my issue -- my worker node security group was misconfigured. I had to add an inbound rule to allow HTTPS (port 443) traffic from the control plane security group.

dijeesh commented 6 years ago

I just added incoming 443 from CONTROLE PLANE SECURITY GROUP and looks like it's working now. Thanks @sc-rz

LucasSales commented 6 years ago

The solution proposed by @MIBc works. Change the metrics-server-deployment.yaml file and add:

command:

zhangzhaorui commented 6 years ago

没关系,这是我的VPC DNS解析的问题

Nevermind, this was an issue with my VPC DNS resolution

hi boss! my metrics-server pod hava the same as error information:

E1026 07:37:04.007899 1 reststorage.go:144] unable to fetch pod metrics for pod dev-java/csg-application-68584c6b66-c65k9: no metrics known for pod E1026 07:37:34.022311 1 reststorage.go:144] unable to fetch pod metrics for pod dev-java/csg-application-68584c6b66-c65k9: no metrics known for pod E1026 07:37:38.242410 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:idc-k8snode-javaphp-001: unable to fetch metrics from Kubelet idc-k8snode-javaphp-001 (idc-k8snode-javaphp-001): Get https://idc-k8snode-javaphp-001:10250/stats/summary/: dial tcp: lookup idc-k8snode-javaphp-001 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8smaster-javaphp-001: unable to fetch metrics from Kubelet idc-k8smaster-javaphp-001 (idc-k8smaster-javaphp-001): Get https://idc-k8smaster-javaphp-001:10250/stats/summary/: dial tcp: lookup idc-k8smaster-javaphp-001 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8snode-javaphp-002: unable to fetch metrics from Kubelet idc-k8snode-javaphp-002 (idc-k8snode-javaphp-002): Get https://idc-k8snode-javaphp-002:10250/stats/summary/: dial tcp: lookup idc-k8snode-javaphp-002 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8snode-javaphp-003: unable to fetch metrics from Kubelet idc-k8snode-javaphp-003 (idc-k8snode-javaphp-003): Get https://idc-k8snode-javaphp-003:10250/stats/summary/: dial tcp: lookup idc-k8snode-javaphp-003 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8smaster-javaphp-002: unable to fetch metrics from Kubelet idc-k8smaster-javaphp-002 (idc-k8smaster-javaphp-002): Get https://idc-k8smaster-javaphp-002:10250/stats/summary/: dial tcp: lookup idc-k8smaster-javaphp-002 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8snode-javaphp-004: unable to fetch metrics from Kubelet idc-k8snode-javaphp-004 (idc-k8snode-javaphp-004): Get https://idc-k8snode-javaphp-004:10250/stats/summary/: dial tcp: lookup idc-k8snode-javaphp-004 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8smaster-javaphp-003: unable to fetch metrics from Kubelet idc-k8smaster-javaphp-003 (idc-k8smaster-javaphp-003): Get https://idc-k8smaster-javaphp-003:10250/stats/summary/: dial tcp: lookup idc-k8smaster-javaphp-003 on 10.96.0.10:53: no such host]

How did you solve it?!

GeekyTex commented 6 years ago

Thanks @LucasSales, this ended up fixing the issue for me as well. It looks like port 443 has since been added to the needed SGs, but I was still getting the following error in my metrics-server:

E1026 14:41:58.325491 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-10-0-166-28.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-166-28.ec2.internal (ip-10-0-166-28.ec2.internal): Get https://ip-10-0-166-28.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-166-28.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-0-135-135.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-135-135.ec2.internal (ip-10-0-135-135.ec2.internal): Get https://ip-10-0-135-135.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-135-135.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-0-146-30.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-146-30.ec2.internal (ip-10-0-146-30.ec2.internal): Get https://ip-10-0-146-30.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-146-30.ec2.internal on 172.20.0.10:53: no such host]

Adding the command above works. Not sure if the root issue is related to CNI or something else. Would be curious to know if anyone else hits this.

FWIW, my cluster was manually set up (still in early POC phase) and was built per the current AWS Getting Started docs.

kiahmed commented 5 years ago

stuck with this issue over a week..tried all the above ..tried @LucasSales approach but that gives certificate error saying not created for that host ip, and my host would be changing in my cluster . port 443 is opened though ..not sure why everybody is talking about that

DirectXMan12 commented 5 years ago

@kiahmed basically, you need to tell metrics-server to connect to your pods using a name or address that it can actually look up. So, by saying InternalIP, you're telling metrics-server to not use hostnames, but instead use the internal IP address of the node. However, if your serving certificates on the Kubelet aren't valid for that IP, you'll get a certificate error.

kiahmed commented 5 years ago

--kubelet-insecure-tls did the job which is okay for now for dev cluster, but even in prod api would be getting access under k8 main apiserver anyway and it has its own CA and validation, so does it really matter?

DirectXMan12 commented 5 years ago

metrics-server doesn't talk to the nodes via the main API server -- it talks to them directly. Using --kubelet-insecure-tls means that someone could MITM the metrics-server <-> kubelet connection, unless you're using some sort of service mesh or what-have-you that provides its own auth.

cdmurph32 commented 5 years ago

Nevermind, this was an issue with my VPC DNS resolution

I think I hit this issue as well, and it wasn't clear to me how VPC settings could break metrics server, besides NACLs. So just in case other people are broken because of their VPC configuration (not because of NACLs):

  1. The value of http://169.254.169.254/latest/meta-data/local-hostname is set from the VPC DHCP settings. https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html
  2. Kubernetes pods get their hostname from this ec2 instance metadata. This sets the node label kubernetes.io/hostname https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1244
  3. Metrics server by default uses this label as the hostname for the node (makes sense). https://github.com/kubernetes-incubator/metrics-server/blob/master/pkg/sources/summary/addrs.go#L23-L40
  4. If your DHCP settings are wrong, (ex you override the defaults unintentionally through copy paste errors in Cloudformation templates, or your custom domain isn't resolvable from within Kubernetes), metrics server won't be able to get anything. unable to fully scrape metrics from source kubelet_summary:ip-10-68-234-200.us-west-2.compute.internal: unable to fetch metrics from Kubelet ip-10-68-234-200.us-west-2.compute.internal (ip-10-68-234-200.ec2.internal): Get https://ip-10-68-234-200.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-68-234-200.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-68-234-239.us-west-2.compute.internal: unable to fetch metrics from Kubelet ip-10-68-234-239.us-west-2.compute.internal (ip-10-68-234-239.ec2.internal): Get https://ip-10-68-234-239.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-68-234-239.ec2.internal on 172.20.0.10:53: no such host
jitesh-prajapati123 commented 5 years ago

I am getting following error.

E1214 06:23:17.408800 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-10-0-3-12.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-3-12.ec2.internal (ip-10-0-3-12.ec2.internal): Get https://ip-10-0-3-12.ec2.internal:10250/stats/summary/: dial tcp: i/o timeout, unable to fully scrape metrics from source kubelet_summary:ip-10-0-1-54.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-1-54.ec2.internal (ip-10-0-1-54.ec2.internal): Get https://ip-10-0-1-54.ec2.internal:10250/stats/summary/: dial tcp: i/o timeout]

When I did curl to https://ip-10-0-3-12.ec2.internal:10250/stats/summary/ it gives me following.

SSL certificate problem: unable to get local issuer certificate curl: (60) SSL certificate problem: unable to get local issuer certificate

jitesh-prajapati123 commented 5 years ago

Thanks @LucasSales, this ended up fixing the issue for me as well. It looks like port 443 has since been added to the needed SGs, but I was still getting the following error in my metrics-server:

E1026 14:41:58.325491 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-10-0-166-28.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-166-28.ec2.internal (ip-10-0-166-28.ec2.internal): Get https://ip-10-0-166-28.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-166-28.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-0-135-135.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-135-135.ec2.internal (ip-10-0-135-135.ec2.internal): Get https://ip-10-0-135-135.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-135-135.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-0-146-30.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-146-30.ec2.internal (ip-10-0-146-30.ec2.internal): Get https://ip-10-0-146-30.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-146-30.ec2.internal on 172.20.0.10:53: no such host]

Adding the command above works. Not sure if the root issue is related to CNI or something else. Would be curious to know if anyone else hits this.

FWIW, my cluster was manually set up (still in early POC phase) and was built per the current AWS Getting Started docs.

I have same issue.

jairovm commented 5 years ago

Hi guys, I'm running metrics-server through a helm chart on EKS and got all my HPA working but one, see:

NAMESPACE       NAME                       REFERENCE                             TARGETS                        MINPODS   MAXPODS   REPLICAS   AGE
datateam        hpa1                    Deployment/hpa1                    15%/75%                        2         10        2          3h
default         hpa2                     Deployment/hpa2                     1%/75%                         2         10        2          21d
default         hpa3              Deployment/hpa3              596%/75%                       2         10        4          20d
nginx-ingress   nginx-ingress-controller   Deployment/nginx-ingress-controller   <unknown>/50%, <unknown>/50%   3         11        3          50m

The one that is not working is another helm chart stable/nginx-ingress.

I have tried with --kubelet-insecure-tls and --kubelet-preferred-address-types=InternalIP without any luck.

top pods is working fine

kubectl top pods -n nginx-ingress                                                                                                                                                      [19:17:34]
NAME                                             CPU(cores)   MEMORY(bytes)
nginx-ingress-controller-6c54d8d8fd-hbnmf        3m           77Mi
nginx-ingress-controller-6c54d8d8fd-m8jb8        3m           76Mi
nginx-ingress-controller-6c54d8d8fd-xvm5d        4m           76Mi
nginx-ingress-default-backend-544cfb69fc-7zvnw   1m           2Mi

Let me know if you need more info, thanks.

Update:

I got nginx-ingress-controller hpa to work by defining resources in my values.yaml file 😅

  resources:
    limits:
      cpu: 100m
      memory: 128Mi
    requests:
      cpu: 100m
      memory: 128Mi
olereidar commented 5 years ago

I had the same issue. This solved my problem: https://stackoverflow.com/q/54106725/2291510

piyushkumar13 commented 5 years ago

@kiahmed and @DirectXMan12 Referring to your comment https://github.com/kubernetes-incubator/metrics-server/issues/129#issuecomment-438448769 and https://github.com/kubernetes-incubator/metrics-server/issues/129#issuecomment-441808822 Adding --kubelet-insecure-tls has worked for me. But is it fine to use this flag for the production cluster ? If not, then what needs to be done to make metrics-server to work ?

LucasSales commented 4 years ago

Is necessary add the resources example: resources: limits: cpu: 500m memory: 254Mi requests: cpu: 1000m memory: 1G

lauer commented 4 years ago

Had same problem. Solved it with this command:

helm upgrade --install metrics stable/metrics-server --namespace kube-system --set hostNetwork.enabled=true --set args={kubelet-insecure-tls}

edrimon commented 5 months ago

Figured out my issue -- my worker node security group was misconfigured. I had to add an inbound rule to allow HTTPS (port 443) traffic from the control plane security group.

Thank you so much, that was it, networking/firewall issue