hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.75k stars 9.1k forks source link

Intermittent network issues (read: connection reset Errors) #14163

Closed lijok closed 4 years ago

lijok commented 4 years ago

Terraform Version

Terraform v0.12.23

We're running a drift detection workflow using github hosted github actions, which simply runs terraform plan and fails if it outputs anything. This runs on a schedule every hour. We're getting request errors, causing terraform plan to fail, around 2-3 times a day

Some of the request errors we've so far encountered:

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/E26H********: read tcp 10.1.0.4:52046->54.239.29.26:443: read: connection reset by peer

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/E1B3D********: read tcp 10.1.0.4:33408->54.239.29.51:443: read: connection reset by peer

Error: error listing tags for CloudFront Distribution (E24R********): RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/tagging?Resource=arn%3Aaws%3Acloudfront%3A%3A*********%3Adistribution%2FE24********: read tcp 10.1.0.4:56918->54.239.29.65:443: read: connection reset by peer

Error: error getting S3 Bucket website configuration: RequestError: send request failed
caused by: Get https://******.s3.amazonaws.com/?website=: read tcp 10.1.0.4:59070->52.216.20.56:443: read: connection reset by peer

Error: error getting S3 Bucket replication: RequestError: send request failed
caused by: Get https://*******.s3.amazonaws.com/?replication=: read tcp 10.1.0.4:60534->52.216.138.67:443: read: connection reset by peer

Most of these seem to be CloudFront and S3

Thanks

unfor19 commented 4 years ago

Same here, using v0.12.28, I'm using drone.io's plugin drone-terraform, output log below

The weird thing - after a couple of restarts, it works without any issues, so it's very inconsistent

...
$ terraform version
Terraform v0.12.28
$ rm -rf .terraform
$ terraform init -input=false
Initializing modules...
...
Initializing the backend...
...
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
...
Initializing provider plugins...
- Checking for available provider plugins...
- Downloading plugin for provider "template" (hashicorp/template) 2.1.2...
- Downloading plugin for provider "random" (hashicorp/random) 2.3.0...
- Downloading plugin for provider "aws" (hashicorp/aws) 2.70.0...
...
* provider.aws: version = "~> 2.70"
* provider.random: version = "~> 2.3"
* provider.template: version = "~> 2.1"

Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
...
$ terraform get
$ terraform validate
Success! The configuration is valid.
$ terraform plan -out=plan.tfout -var image_tag=drone-latest -var sha=1a2b3c4d
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.
...
TONS OF Refreshing state messages...
...
Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/truncated: read tcp 192.168.0.1:33534->53.229.31.61:443: read: connection reset by peer

time="2020-07-16T15:43:20Z" level=fatal msg="Failed to execute a command" error="exit status 1" 
mo-hit commented 4 years ago

Getting the same issue when running plan or apply, with cloudfront

Error: error listing tags for CloudFront Distribution <redacted>: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/tagging?Resource=arn%3Aaws%3Acloudfront%3A%3<redacted>%3Adistribution%2<redacted>: read tcp 192.168.1.94:51422->54.239.29.65:443: read: connection
 reset by peer

started happening intermittently about 3 days ago tf 0.12.28

lcaproni-pp commented 4 years ago

Also seen the same issue with Cloudfront:

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/read: connection reset by peer

TF Version - 0.12.28

tbugfinder commented 4 years ago

Hi, today I also run into such an error:

Error: error listing tags for ACM Certificate (arn:aws:acm:eu-west-1:111111111:certificate/800000f-1111-2222-bedb-9096d4c8a692): RequestError: send request failed
caused by: Post https://acm.eu-west-1.amazonaws.com/: read tcp 10.10.10.10:43720->123.1.1.1:443: read: connection reset by peer

(IPs changed) :-)

I have to use a proxy server in-betweeen (IP 123.x.x.x) however I'd expect terraform or the provider to run a retry.

$ terraform version
Terraform v0.12.25

provider version
2.66
tristanhoskenjs commented 4 years ago

Having the same issues in our CI/CD pipeline

acburdine commented 4 years ago

Seeing this same issue in Terraform Cloud, specifically with the cloudfront_distribution and cloudfront_origin_access_identity resources - it's happening almost daily at this point.

bflad commented 4 years ago

It would be great if we could get a Gist with debug logging enabled so we can further troubleshoot. If you are worried about any sensitive data, it can be encrypted with the HashiCorp GPG Key or redacted as necessary.

The maintainers will need this information to be able to see and triage the current provider and AWS Go SDK behavior during them.

lijok commented 4 years ago

It would be great if we could get a Gist with debug logging enabled so we can further troubleshoot. If you are worried about any sensitive data, it can be encrypted with the HashiCorp GPG Key or redacted as necessary.

The maintainers will need this information to be able to see and triage the current provider and AWS Go SDK behavior during them.

Cool, I'll enable debug on the workflow and post back once we catch it happening

mattburgess commented 4 years ago

We're hitting this too, and have debug logs enabled. Will clear this with security and get back to you. In the mean time though, we're seeing two slightly different behaviours.

Some calls cause the run to fail immediately and others cause up to 15 minutes pauses before a retry is attempted, at which point the plan succeeds and the CI job continues on.

Some of our calls go through VPC endpoints wherever possible, but where that's not, they end up going through an Internet Proxy (Squid). So far, we've only seen the proxy-routed calls cause the 15 minute pause and the VPC-endpoint-routed calls cause an immediate failure but a) there's too little data to extract any kind of pattern and b) given they're different services then the retry logic might be different between services.

mattburgess commented 4 years ago

GPG-encrypted logs available at https://gist.github.com/mattburgess/2a00b1e77b00368781360ac8581383b9

analytical-dataset-generation_analytical-dataset-generation-qa_154.log.gpg - this one failed after seeing a single connection reset by peer error; no retries were attempted.

analytical-dataset-generation_analytical-dataset-generation-preprod_136.log.gpg - this one hung/paused/waited for 15 minutes having seen a connection reset by peer error, then retried and succeeded on its first retry.

awsiv commented 4 years ago

seeing this on v0.12.29 as well

blakemorgan commented 4 years ago

Just got this issue on v0.13.0. The first two times it failed and the third time worked as expected. All three times it was running in a GitHub Action.

ivorcheung commented 4 years ago

I had the same issue last night. Ran it again in the morning and it was fine. This is a rather intermittent issue.

Got this on v.0.12.28

ZsoltPath commented 4 years ago

Same here on TF v0.13.0 and AWS provider v3.3.0 And as someone mentioned above, it mainly happens when running it in GitHub actions (CI/CD).

edwardofclt commented 4 years ago

We're experiencing the issue also in Terraform Cloud using v0.12.28 & 0.12.29 and the AWS provider pinned to ~> 2.0.

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/ABCD1234567: read tcp 10.181.43.96:56350->54.239.29.51:443: read: connection reset by peer

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/ABCD1234567: read tcp 10.181.43.96:57570->54.239.29.51:443: read: connection reset by peer
cpuspellcaster commented 4 years ago

Same issue. Terraform v0.12.29, AWS provider 3.3.0, running in CircleCI. It's intermittent but occurring in roughly 10% of the TF executions per day.

chrusty commented 4 years ago

I've had this issue with v0.12.26 and v0.12.28. So persistent that we've had to wrap any Terraform execution in multiple layers of retry

bflad commented 4 years ago

Hi folks 👋 Its not entirely clear why this is more of an issue all of the sudden for a lot more environments except that maybe AWS' service APIs are resetting connections more aggressively. Understandably, this error is very problematic though.

The challenge here is that the AWS Go SDK request handlers explicitly catch this specific condition, a ECONNRESET type error during the read operation of an API call, to disable the retry logic. This logic has been present in the AWS Go SDK since version 1.20.2 and the Terraform AWS Provider version 2.16.0. The code can be seen here:

https://github.com/aws/aws-sdk-go/blob/fde575c64841b291899bc112dfcdc206f609a305/aws/request/connection_reset_error.go#L8-L10

Which is eventually handled here:

https://github.com/aws/aws-sdk-go/blob/fde575c64841b291899bc112dfcdc206f609a305/aws/request/retryer.go#L168-L185

Some of the upstream decision process for this can be seen here:

Essentially boiling down to this:

The logic behind this change is that the SDK is not able to sufficiently determine the state of an API request after successfully writing the request to the transport layer, but then failing to read the corresponding response due to a connection reset occurring. This is due to the fact that the SDK has no knowledge about whether the given operation is idempotent or whether it would be safe to retry.

I would personally agree with their assessment on the surface and say that the Terraform AWS Provider would not want to always retry in this case, since without some very careful investigation the potential effects would broadly be unknown. While we might be in a slightly better situation than the whole SDK since we are mainly dealing with management API calls (Create/Read/Update/Delete/List) rather than event/data API calls, we would still have issues with this type of retry logic including:

This leaves us in a little bit of a bind in this project. 😖 We have been purposefully avoiding implementing any custom retryer logic to decrease any maintenance and testing in that considerably harder area. Outside of that we could implement this logic per AWS Go SDK service client as we do today for some other retryable conditions (see aws/config.go), however attempting to enumerate all safely idempotent API calls is a massive undertaking, even after using loose heuristics such as trying to say all "read-only" calls such as Describe*/Get*/List* are retryable (and potentially Create*/Put*/Set* where we include a ClientToken/IdempotencyToken) for this specific handling.

Another option may be to suggest this type of enhancement (or some may say bug fix) upstream into the AWS Go SDK codebase itself, but I'm not sure if the upstream maintainers would want to get into this space either.

I'm out of time to ponder on this more for tonight, but hopefully this initial insight can kickstart some discussions.

chrusty commented 4 years ago

@bflad does it seem to you that this is particularly prevalent with the CloudFront API? I know that in my case it certainly is, and I can see from the rest of the comments in this issue that CloudFront is involved very often.

lifeofguenter commented 4 years ago

@bflad most probably a long shot, but would there be any connection with https://github.com/terraform-providers/terraform-provider-aws/issues/14797 + https://github.com/hashicorp/terraform/issues/25835#issuecomment-674299327 ?

It seems after upgrading from 0.12.24 (could be though coincidence that aws maybe changed their rate limiting at the same time) we have been both getting issues as described in this thread, but also more intermittent "No valid credential sources found for AWS Provider" issues.

ZsoltPath commented 4 years ago

This leaves us in a little bit of a bind in this project. 😖 We have been purposefully avoiding implementing any custom retryer logic to decrease any maintenance and testing in that considerably harder area. Outside of that we could implement this logic per AWS Go SDK service client as we do today for some other retryable conditions (see aws/config.go), however attempting to enumerate all safely idempotent API calls is a massive undertaking, even after using loose heuristics such as trying to say all "read-only" calls such as Describe*/Get*/List* are retryable (and potentially Create*/Put*/Set* where we include a ClientToken/IdempotencyToken) for this specific handling.

@bflad I'd say retrying Describe*/Get*/List* could be harmless and probably help a lot. I haven't looked into debug log but from the surface it happens most of the time at a Describe operation. Either when collecting the states at the beginning or when after creation of a Cloudfront distribution TF is periodically checking the status. Both would be solved with a retry.

Regarding the write operation would it be possible to add it as a switch? Either to the apply command or as a lifecycle option to the actual resources. Then users can decide whether risk it or not and what their actual use case.

spouzols commented 4 years ago

Hello. Hitting the same kind of behaviour, more frequently lately. Terraform 0.12.28, AWS provider 2.70.0, running on Concourse CI on AWS. Almost always connection resets while waiting for a CloudFront distribution creation / update.

acburdine commented 4 years ago

for what it's worth - every time I've seen this issue it's been on read calls to either Cloudfront distribution configs or Cloudfront origin access identities.

It may not be the best way to approach solving the issue, but given that the majority of the connection reset issues seem to be with specific Cloudfront read calls + a few others, it might be worth just adding retries to individual API calls (Cloudfront or otherwise) as they become problematic?

tbugfinder commented 4 years ago

I don't use any cloudfront resources.

bflad commented 4 years ago

As mentioned above, the most pragmatic approach for this may be to try and implement temporary quick fixes for the most problematic cases until we can determine root causes and work on more permanent solutions. In an effort to accomplish that, it would be great if we can rally around the most problematic API calls and see if we cannot figure out some additional debugging details along this journey.

If you haven't already, we would strongly encourage filing an AWS Support technical support case to alert the AWS service teams of the increased API connection reset errors. Please feel free to link back to this GitHub issue. We are happy to introduce additional changes (e.g. extra logging in addition to our available debug logging) to support AWS troubleshooting efforts.

Can folks please comment with the below details:

For example:

Error:

Error: RequestError: send request failed caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/ABCD1234567: read tcp 10.181.43.96:57570->54.239.29.51:443: read: connection reset by peer


| Question | Answer |
| --- | --- |
| Terraform Resource | aws_cloudfront_distribution |
| Terraform Operation | Read |
| AWS Service | CloudFront |
| API Call | GetDistribution |
| Terraform Environment | Corporate network |
| Terraform Concurrency | 10 (default) |
| Known HTTP Proxy | Yes (Squid X.Y.Z) |
| How Many Resources | 50 in same configuration |
| How Often | 10% of runs |

Any other relevant information.

I have an initial hunch that this could be related to the recent Application and Classic Load Balancers are adding defense in depth with the introduction of Desync Mitigation Mode. Many production service APIs are run using the same AWS infrastructure components publicly available. The underlying HTTP Desync Guardian project includes some documentation and diagrams to show its behaviors. The mitigations section is particularly helpful in describing the conceptual behaviors.

What we may be seeing could be two-fold if it is related to the above:

Gathering the above details may help tease this out.


We may also want to create some additional AWS Go SDK tracking issues as well. For example, we may need the AWS Go SDK to always debug log the request of API calls, even if the request fails in this state. Currently, the debug logging seems to just give the error and not the request payload like:

---[ REQUEST POST-SIGN ]-----------------------------
POST / HTTP/1.1
Host: ec2.eu-west-2.amazonaws.com
User-Agent: aws-sdk-go/1.33.21 (go1.14.5; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.12.19 (+https://www.terraform.io)
Content-Length: 79
Content-Type: application/x-www-form-urlencoded; charset=utf-8
X-Amz-Date: 20200814T100330Z
Accept-Encoding: gzip

Action=DescribeSecurityGroups&GroupId.1=sg-12345678&Version=2016-11-15
-----------------------------------------------------
spouzols commented 4 years ago

Error:

Error: error waiting until CloudFront Distribution (XXXXX) is deployed: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/XXXXX: read tcp 10.x.x.x:35832->54.x.x.x:443: read: connection reset by peer
Question Answer
Terraform Resource aws_cloudfront_distribution
Terraform Operation Read
AWS Service CloudFront
API Call GetDistribution
Terraform Environment AWS VPC (EC2, Concourse CI)
Terraform Concurrency 10 (default)
Known HTTP Proxy No
How Many Resources 1 in same configuration
How Often 80% of runs, 4/5 in 24h

Terraform 0.12.28, AWS provider 2.70.0

lifeofguenter commented 4 years ago

We have a support ticket request open with aws for both this issue and https://github.com/terraform-providers/terraform-provider-aws/issues/14797 - especially in the latter case it would greatly help if TRACE would show complete requests + responses for us/aws to understand what is going on.

Or maybe even something separate like HTTP_TRACE that only shows requests + responses, which in most cases is the more interesting part when debugging these type of issues.

We are experiencing this issue on our Jenkins hosted on EC2 - we run multiple nodes behind a natgw (so shared IP for outgoing connections).

lijok commented 4 years ago

There is definitely a problem on the AWS side If you go to the cloudfront console and hit refresh a few times, you're now very likely to encounter this image

encron commented 4 years ago

Error:

Error: RequestError: send request failed
       caused by: Get "https://cloudfront.amazonaws.com/2020-05-31/origin-access-identity/cloudfront/E33T16DJ8BRX2": read tcp 10.170.3.101:33268->54.239.29.51:443: read: connection reset by peer
Question Answer
Terraform Resource aws_cloudfront_distribution
Terraform Operation Read (refreshing state or waiting for the distribution to be deployed/destroyed)
AWS Service CloudFront
API Call GetDistribution
Terraform Environment AWS VPC
Terraform Concurrency 10 (default)
Known HTTP Proxy No
How Many Resources 2
How Often 90% of runs

At first I assumed this was due to Terraform polling and waiting for the distribution to be deployed, which is why I added wait_for_deployment = false, yet it seems to have even worsened the behaviour and it's even failing when refreshing the state. I saw the bulk of the errors happening yesterday when also disabling Cloudfront distributions seemed to take a very long time. This morning upon retrying again, the error rate is way less.

lijok commented 4 years ago

We haven't had this happen for more than a week now Could be fixed on aws side?

bflad commented 4 years ago

Hi again 👋 Since it appears that this was handled on the AWS side (both in this issue and lack of Terraform support tickets), our preference will be to leave things as they are for now. If this comes up again, especially since CloudFront seems to very prominently have this issue when it occurs, we can definitely think more about this network connection handling. 👍

tibbon commented 3 years ago

I started seeing these today.

Error: Error reading IAM policy version arn:aws:iam::XXXX:policy/OktaChildAccountPolicy: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.1.216:52180->52.94.225.3:443: read: connection reset by peer

Error: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.1.216:52172->52.94.225.3:443: read: connection reset by peer

Error: Error reading IAM Role Okta-Idp-cross-account-role: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.1.216:52171->52.94.225.3:443: read: connection reset by peer
tarazena commented 3 years ago

@tibbon I was seeing it few minutes ago and now its gone

isikdos commented 3 years ago

I'm seeing it and the issues persist. I've been restarting my CI pipeline for about half an hour hoping it's transient, but it's sticking around. Likewise, mine is with the iam.amazonaws.com

Edit: 40th minute was the charm. You can force through it with enough retries. As far as I could tell, I only had 2 or 3 items that were failing. If you have many more, you might just be probablistically stuck until the broader problem is resolved.

dchernivetsky commented 3 years ago

Same here. Started half an hour ago.

azemon commented 3 years ago

I just starting hitting this issue, too. It's an old Terraform project, which we run several times per week. All of a sudden, it's causing problems.

Error: Error reading IAM Role ABCDEF: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.0.129:44552->52.94.225.3:443: read: connection reset by peer

Error: error finding IAM Role (GHIJKL) Policy Attachment (arn:aws:iam::aws:policy/AmazonInspectorFullAccess): RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.0.129:44872->52.94.225.3:443: read: connection reset by peer
claco commented 3 years ago

https://status.aws.amazon.com/

1:50 PM PDT We are investigating increased error rates and latencies affecting IAM. IAM related requests to other AWS services may also be impacted.

ghost commented 3 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!