kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.07k stars 3.97k forks source link

Failed to create AWS Manager: RequestError: send request failed ( i/o timeout) #1860

Closed AESwrite closed 4 years ago

AESwrite commented 5 years ago

Hello, I'm currently using kubernetes v1.12.5 and CA v1.12.3 Cluster was created with kubespray v.2.8.3 (kubeadm enabled). Provider: AWS

I'm using the standart example of cluster-autoscaler-one-asg.yaml, I've modified only this lines:

containers: - image: k8s.gcr.io/cluster-autoscaler:v1.12.3

--nodes=1:4:k8s-worker-20190403143728147500000003-asg

_env:

  • name: AWSREGION value: us-east-2

    volumes:

  • name: ssl-certs hostPath: path: "/etc/ssl/certs/ca-bundle.crt"

I get this kind of error (the same in different versions of CA):

I0404 13:30:06.959934 1 leaderelection.go:227] successfully renewed lease kube-system/cluster-autoscaler E0404 13:30:07.364127 1 aws_manager.go:153] Failed to regenerate ASG cache: RequestError: send request failed caused by: Post https://autoscaling.us-east-2.amazonaws.com/: dial tcp: i/o timeout F0404 13:30:07.364158 1 cloud_provider_builder.go:149] caused by: Post https://autoscaling.us-east-2.amazonaws.com/: dial tcp: i/o timeout'

I tried to use CA v1.3.0 on kubernetes v1.11.3 last week (the same yaml file, only different version of CA), and it worked. But today i get timeout error even on that v1.11.3 configuration (i didn't change anything in this configuration from last week).

How can i solve this issue? I will be glad to any help!

Update 1: container with autoscaler somehow can't reach internet.

Jeffwan commented 5 years ago

/assign

Jeffwan commented 5 years ago

Can you run kubectl get pods ${your_ca_pod} -o yaml and check what's the value of dnsPolicy?

Is you cluster running in us-east-2? AWS_REGION is not required in v1.12.

From last week, samples has some clean up but I didn't see any problems there.

AESwrite commented 5 years ago

kubectl get pods ${your_ca_pod} -o yaml

dnsPolicy: ClusterFirst

Yes, cluster is running in us-east-2. I discovered that pod somehow uses default resolv.conf file

; generated by /usr/sbin/dhclient-script search us-east-2.compute.internal nameserver 10.0.0.2

And when i added nameserver 8.8.8.8 to it on master and worker, the CA started to work. I'm not sure if it's a solution or just a workaround (i don't think CA should use this file, because kubespray should write it's own resolv.conf, so maybe it is a kubespray problem), but now i can google some similar cases and find it out.

AESwrite commented 5 years ago

The problem is still relevant. Sometimes it works (in 10% of cases), but most of the time it crashes with timeout error. DNS settings and services look ok, and working on other pods perfectly.

Jeffwan commented 5 years ago

Hi @AESwrite, sorry for late response. Anything special to your VPC settings? If you can consistently reproduce this issue, that probably a bug somewhere. I'd like to try to reproduce and fix it.

Siddharthk commented 5 years ago

I am also getting the same error: CA version: 1.12.3 AWS EKS version: 1.12.7

I0529 12:05:53.849655 1 leaderelection.go:227] successfully renewed lease kube-system/cluster-autoscaler

I0529 12:05:55.942323 1 leaderelection.go:227] successfully renewed lease kube-system/cluster-autoscaler

E0529 12:05:56.036033 1 aws_manager.go:153] Failed to regenerate ASG cache: RequestError: send request failed

caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp 72.21.206.37:443: i/o timeout

F0529 12:05:56.036064 1 cloud_provider_builder.go:149] Failed to create AWS Manager: RequestError: send request failed

caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp 72.21.206.37:443: i/o timeout

@AESwrite @Jeffwan Can someone help here? It was working fine on EKS v1.11.5.

thinalai commented 5 years ago

I am also getting the same error: CA version: 1.12.3 AWS EKS version: 1.12.7

I0529 12:05:53.849655 1 leaderelection.go:227] successfully renewed lease kube-system/cluster-autoscaler

I0529 12:05:55.942323 1 leaderelection.go:227] successfully renewed lease kube-system/cluster-autoscaler

E0529 12:05:56.036033 1 aws_manager.go:153] Failed to regenerate ASG cache: RequestError: send request failed

caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp 72.21.206.37:443: i/o timeout

F0529 12:05:56.036064 1 cloud_provider_builder.go:149] Failed to create AWS Manager: RequestError: send request failed

caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp 72.21.206.37:443: i/o timeout

@AESwrite @Jeffwan Can someone help here? It was working fine on EKS v1.11.5.

check your CA pod public network accessibility

Jeffwan commented 5 years ago

/sig aws

lmansur commented 5 years ago

I was running into this problem in a cluster created by kops using a pre-existing VPC.

The route table for my subnets was the default one created by AWS. Setting the route table created by kops as main and deleting the one created by AWS solved my problem.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

michaelpporter commented 4 years ago

I am having the same issue. We have installed CA in 4 different VPCs the 3 us-east-1 instances all work fine, the 3 in us-east-2 fails.

{"log":"E0109 14:11:31.735207       1 aws_manager.go:153] Failed to regenerate ASG cache: RequestError: send request failed\n","stream":"stderr","time":"2020-01-09T14:11:31.736567592Z"}
{"log":"caused by: Post https://autoscaling.us-east-2.amazonaws.com/: dial tcp: i/o timeout\n","stream":"stderr","time":"2020-01-09T14:11:31.736595585Z"}
{"log":"F0109 14:11:31.735237       1 cloud_provider_builder.go:149] Failed to create AWS Manager: RequestError: send request failed\n","stream":"stderr","time":"2020-01-09T14:11:31.736600881Z"}
{"log":"caused by: Post https://autoscaling.us-east-2.amazonaws.com/: dial tcp: i/o timeout\n","stream":"stderr","time":"2020-01-09T14:11:31.736605602Z"}

We have used the helm version and the standalone example for multi-asg

The EKS is built with terraform, so each one is using the same settings, besides region and VPC.

We have used 2 different accounts, VPCs are in one account while the failing one is in another. We are adding an EKS cluster to us-east-1 in the account that is currently failing to test regions. I will report back our findings.

michaelpporter commented 4 years ago

Summary:

account-vpc region status
data-qa2 us-east-2 inconsistent
data-qa1 us-east-1 inconsistent
nonprod-qa1 us-east-1 success
nonprod-test us-east-1 success
nonprod-stage us-east-1 success

I will report if we find anything new.

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/1860#issuecomment-583773405): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
biswarup1290dass commented 4 years ago

@Jeffwan From the above discussion I was not able to conclude what was the solution to this issue. Would it be possible for you to help me for a similar issue mentioned below: E0323 15:19:16.485010 1 aws_manager.go:259] Failed to regenerate ASG cache: RequestError: send request failed caused by: Post https://autoscaling.us-east-2.amazonaws.com/: dial tcp: i/o timeout F0323 15:19:16.485057 1 aws_cloud_provider.go:330] Failed to create AWS Manager: RequestError: send request failed caused by: Post https://autoscaling.us-east-2.amazonaws.com/: dial tcp: i/o timeout

I am using the multi ASG deploynment for AWS CA. Versions: CA Version : k8s.gcr.io/cluster-autoscaler:v1.14.7 EKS Version : 1.14 Platform Version eks.9 coredns: v1.6.6 aws-node: amazon-k8s-cni:v1.5.5

Jeffwan commented 4 years ago

@biswarup1290dass Em... Can you share dnsPolicy of your pod and if your coreDNS pod is running well (check logs probably)

yiyan-wish commented 3 years ago

same issue E1210 03:15:12.303121 1 aws_manager.go:265] Failed to regenerate ASG cache: cannot autodiscover ASGs: RequestError: send request failed caused by: Post "https://autoscaling.cn-northwest-1.amazonaws.com.cn/": dial tcp 52.82.209.176:443: i/o timeout and the cluster-autoscaler pod's dnsPolicy is ClusterFirst