Closed lijok closed 4 years ago
Same here, using v0.12.28
, I'm using drone.io's plugin drone-terraform, output log below
The weird thing - after a couple of restarts, it works without any issues, so it's very inconsistent
...
$ terraform version
Terraform v0.12.28
$ rm -rf .terraform
$ terraform init -input=false
Initializing modules...
...
Initializing the backend...
...
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
...
Initializing provider plugins...
- Checking for available provider plugins...
- Downloading plugin for provider "template" (hashicorp/template) 2.1.2...
- Downloading plugin for provider "random" (hashicorp/random) 2.3.0...
- Downloading plugin for provider "aws" (hashicorp/aws) 2.70.0...
...
* provider.aws: version = "~> 2.70"
* provider.random: version = "~> 2.3"
* provider.template: version = "~> 2.1"
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
...
$ terraform get
$ terraform validate
Success! The configuration is valid.
$ terraform plan -out=plan.tfout -var image_tag=drone-latest -var sha=1a2b3c4d
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.
...
TONS OF Refreshing state messages...
...
Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/truncated: read tcp 192.168.0.1:33534->53.229.31.61:443: read: connection reset by peer
time="2020-07-16T15:43:20Z" level=fatal msg="Failed to execute a command" error="exit status 1"
Getting the same issue when running plan or apply, with cloudfront
Error: error listing tags for CloudFront Distribution <redacted>: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/tagging?Resource=arn%3Aaws%3Acloudfront%3A%3<redacted>%3Adistribution%2<redacted>: read tcp 192.168.1.94:51422->54.239.29.65:443: read: connection
reset by peer
started happening intermittently about 3 days ago tf 0.12.28
Also seen the same issue with Cloudfront:
Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/read: connection reset by peer
TF Version - 0.12.28
Hi, today I also run into such an error:
Error: error listing tags for ACM Certificate (arn:aws:acm:eu-west-1:111111111:certificate/800000f-1111-2222-bedb-9096d4c8a692): RequestError: send request failed
caused by: Post https://acm.eu-west-1.amazonaws.com/: read tcp 10.10.10.10:43720->123.1.1.1:443: read: connection reset by peer
(IPs changed) :-)
I have to use a proxy server in-betweeen (IP 123.x.x.x) however I'd expect terraform or the provider to run a retry.
$ terraform version
Terraform v0.12.25
provider version
2.66
Having the same issues in our CI/CD pipeline
Seeing this same issue in Terraform Cloud, specifically with the cloudfront_distribution and cloudfront_origin_access_identity resources - it's happening almost daily at this point.
It would be great if we could get a Gist with debug logging enabled so we can further troubleshoot. If you are worried about any sensitive data, it can be encrypted with the HashiCorp GPG Key or redacted as necessary.
The maintainers will need this information to be able to see and triage the current provider and AWS Go SDK behavior during them.
It would be great if we could get a Gist with debug logging enabled so we can further troubleshoot. If you are worried about any sensitive data, it can be encrypted with the HashiCorp GPG Key or redacted as necessary.
The maintainers will need this information to be able to see and triage the current provider and AWS Go SDK behavior during them.
Cool, I'll enable debug on the workflow and post back once we catch it happening
We're hitting this too, and have debug logs enabled. Will clear this with security and get back to you. In the mean time though, we're seeing two slightly different behaviours.
Some calls cause the run to fail immediately and others cause up to 15 minutes pauses before a retry is attempted, at which point the plan succeeds and the CI job continues on.
Some of our calls go through VPC endpoints wherever possible, but where that's not, they end up going through an Internet Proxy (Squid). So far, we've only seen the proxy-routed calls cause the 15 minute pause and the VPC-endpoint-routed calls cause an immediate failure but a) there's too little data to extract any kind of pattern and b) given they're different services then the retry logic might be different between services.
GPG-encrypted logs available at https://gist.github.com/mattburgess/2a00b1e77b00368781360ac8581383b9
analytical-dataset-generation_analytical-dataset-generation-qa_154.log.gpg - this one failed after seeing a single connection reset by peer
error; no retries were attempted.
analytical-dataset-generation_analytical-dataset-generation-preprod_136.log.gpg - this one hung/paused/waited for 15 minutes having seen a connection reset by peer
error, then retried and succeeded on its first retry.
seeing this on v0.12.29
as well
Just got this issue on v0.13.0
. The first two times it failed and the third time worked as expected. All three times it was running in a GitHub Action.
I had the same issue last night. Ran it again in the morning and it was fine. This is a rather intermittent issue.
Got this on v.0.12.28
Same here on TF v0.13.0 and AWS provider v3.3.0 And as someone mentioned above, it mainly happens when running it in GitHub actions (CI/CD).
We're experiencing the issue also in Terraform Cloud using v0.12.28
& 0.12.29
and the AWS provider pinned to ~> 2.0
.
Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/ABCD1234567: read tcp 10.181.43.96:56350->54.239.29.51:443: read: connection reset by peer
Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/ABCD1234567: read tcp 10.181.43.96:57570->54.239.29.51:443: read: connection reset by peer
Same issue. Terraform v0.12.29
, AWS provider 3.3.0
, running in CircleCI. It's intermittent but occurring in roughly 10% of the TF executions per day.
I've had this issue with v0.12.26 and v0.12.28. So persistent that we've had to wrap any Terraform execution in multiple layers of retry
Hi folks 👋 Its not entirely clear why this is more of an issue all of the sudden for a lot more environments except that maybe AWS' service APIs are resetting connections more aggressively. Understandably, this error is very problematic though.
The challenge here is that the AWS Go SDK request handlers explicitly catch this specific condition, a ECONNRESET
type error during the read operation of an API call, to disable the retry logic. This logic has been present in the AWS Go SDK since version 1.20.2 and the Terraform AWS Provider version 2.16.0. The code can be seen here:
Which is eventually handled here:
Some of the upstream decision process for this can be seen here:
Essentially boiling down to this:
The logic behind this change is that the SDK is not able to sufficiently determine the state of an API request after successfully writing the request to the transport layer, but then failing to read the corresponding response due to a connection reset occurring. This is due to the fact that the SDK has no knowledge about whether the given operation is idempotent or whether it would be safe to retry.
I would personally agree with their assessment on the surface and say that the Terraform AWS Provider would not want to always retry in this case, since without some very careful investigation the potential effects would broadly be unknown. While we might be in a slightly better situation than the whole SDK since we are mainly dealing with management API calls (Create/Read/Update/Delete/List) rather than event/data API calls, we would still have issues with this type of retry logic including:
This leaves us in a little bit of a bind in this project. 😖 We have been purposefully avoiding implementing any custom retryer logic to decrease any maintenance and testing in that considerably harder area. Outside of that we could implement this logic per AWS Go SDK service client as we do today for some other retryable conditions (see aws/config.go
), however attempting to enumerate all safely idempotent API calls is a massive undertaking, even after using loose heuristics such as trying to say all "read-only" calls such as Describe*
/Get*
/List*
are retryable (and potentially Create*
/Put*
/Set*
where we include a ClientToken
/IdempotencyToken
) for this specific handling.
Another option may be to suggest this type of enhancement (or some may say bug fix) upstream into the AWS Go SDK codebase itself, but I'm not sure if the upstream maintainers would want to get into this space either.
I'm out of time to ponder on this more for tonight, but hopefully this initial insight can kickstart some discussions.
@bflad does it seem to you that this is particularly prevalent with the CloudFront API? I know that in my case it certainly is, and I can see from the rest of the comments in this issue that CloudFront is involved very often.
@bflad most probably a long shot, but would there be any connection with https://github.com/terraform-providers/terraform-provider-aws/issues/14797 + https://github.com/hashicorp/terraform/issues/25835#issuecomment-674299327 ?
It seems after upgrading from 0.12.24 (could be though coincidence that aws maybe changed their rate limiting at the same time) we have been both getting issues as described in this thread, but also more intermittent "No valid credential sources found for AWS Provider" issues.
This leaves us in a little bit of a bind in this project. 😖 We have been purposefully avoiding implementing any custom retryer logic to decrease any maintenance and testing in that considerably harder area. Outside of that we could implement this logic per AWS Go SDK service client as we do today for some other retryable conditions (see
aws/config.go
), however attempting to enumerate all safely idempotent API calls is a massive undertaking, even after using loose heuristics such as trying to say all "read-only" calls such asDescribe*
/Get*
/List*
are retryable (and potentiallyCreate*
/Put*
/Set*
where we include aClientToken
/IdempotencyToken
) for this specific handling.
@bflad
I'd say retrying Describe*
/Get*
/List*
could be harmless and probably help a lot.
I haven't looked into debug log but from the surface it happens most of the time at a Describe
operation. Either when collecting the states at the beginning or when after creation of a Cloudfront distribution TF is periodically checking the status.
Both would be solved with a retry.
Regarding the write operation would it be possible to add it as a switch? Either to the apply
command or as a lifecycle option to the actual resources.
Then users can decide whether risk it or not and what their actual use case.
Hello. Hitting the same kind of behaviour, more frequently lately. Terraform 0.12.28, AWS provider 2.70.0, running on Concourse CI on AWS. Almost always connection resets while waiting for a CloudFront distribution creation / update.
for what it's worth - every time I've seen this issue it's been on read calls to either Cloudfront distribution configs or Cloudfront origin access identities.
It may not be the best way to approach solving the issue, but given that the majority of the connection reset issues seem to be with specific Cloudfront read calls + a few others, it might be worth just adding retries to individual API calls (Cloudfront or otherwise) as they become problematic?
I don't use any cloudfront resources.
As mentioned above, the most pragmatic approach for this may be to try and implement temporary quick fixes for the most problematic cases until we can determine root causes and work on more permanent solutions. In an effort to accomplish that, it would be great if we can rally around the most problematic API calls and see if we cannot figure out some additional debugging details along this journey.
If you haven't already, we would strongly encourage filing an AWS Support technical support case to alert the AWS service teams of the increased API connection reset errors. Please feel free to link back to this GitHub issue. We are happy to introduce additional changes (e.g. extra logging in addition to our available debug logging) to support AWS troubleshooting efforts.
Can folks please comment with the below details:
RequestError
line and the caused by:
line (redacting any sensitive resource identifiers and IP addresses if necessary)-parallelism
flag configured above 10For example:
Error:
Error: RequestError: send request failed caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/ABCD1234567: read tcp 10.181.43.96:57570->54.239.29.51:443: read: connection reset by peer
| Question | Answer |
| --- | --- |
| Terraform Resource | aws_cloudfront_distribution |
| Terraform Operation | Read |
| AWS Service | CloudFront |
| API Call | GetDistribution |
| Terraform Environment | Corporate network |
| Terraform Concurrency | 10 (default) |
| Known HTTP Proxy | Yes (Squid X.Y.Z) |
| How Many Resources | 50 in same configuration |
| How Often | 10% of runs |
Any other relevant information.
I have an initial hunch that this could be related to the recent Application and Classic Load Balancers are adding defense in depth with the introduction of Desync Mitigation Mode. Many production service APIs are run using the same AWS infrastructure components publicly available. The underlying HTTP Desync Guardian project includes some documentation and diagrams to show its behaviors. The mitigations section is particularly helpful in describing the conceptual behaviors.
What we may be seeing could be two-fold if it is related to the above:
Gathering the above details may help tease this out.
We may also want to create some additional AWS Go SDK tracking issues as well. For example, we may need the AWS Go SDK to always debug log the request of API calls, even if the request fails in this state. Currently, the debug logging seems to just give the error and not the request payload like:
---[ REQUEST POST-SIGN ]-----------------------------
POST / HTTP/1.1
Host: ec2.eu-west-2.amazonaws.com
User-Agent: aws-sdk-go/1.33.21 (go1.14.5; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.12.19 (+https://www.terraform.io)
Content-Length: 79
Content-Type: application/x-www-form-urlencoded; charset=utf-8
X-Amz-Date: 20200814T100330Z
Accept-Encoding: gzip
Action=DescribeSecurityGroups&GroupId.1=sg-12345678&Version=2016-11-15
-----------------------------------------------------
Error:
Error: error waiting until CloudFront Distribution (XXXXX) is deployed: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/XXXXX: read tcp 10.x.x.x:35832->54.x.x.x:443: read: connection reset by peer
Question | Answer |
---|---|
Terraform Resource | aws_cloudfront_distribution |
Terraform Operation | Read |
AWS Service | CloudFront |
API Call | GetDistribution |
Terraform Environment | AWS VPC (EC2, Concourse CI) |
Terraform Concurrency | 10 (default) |
Known HTTP Proxy | No |
How Many Resources | 1 in same configuration |
How Often | 80% of runs, 4/5 in 24h |
Terraform 0.12.28, AWS provider 2.70.0
We have a support ticket request open with aws for both this issue and https://github.com/terraform-providers/terraform-provider-aws/issues/14797 - especially in the latter case it would greatly help if TRACE
would show complete requests + responses for us/aws to understand what is going on.
Or maybe even something separate like HTTP_TRACE
that only shows requests + responses, which in most cases is the more interesting part when debugging these type of issues.
We are experiencing this issue on our Jenkins hosted on EC2 - we run multiple nodes behind a natgw (so shared IP for outgoing connections).
There is definitely a problem on the AWS side If you go to the cloudfront console and hit refresh a few times, you're now very likely to encounter this
Error:
Error: RequestError: send request failed
caused by: Get "https://cloudfront.amazonaws.com/2020-05-31/origin-access-identity/cloudfront/E33T16DJ8BRX2": read tcp 10.170.3.101:33268->54.239.29.51:443: read: connection reset by peer
Question | Answer |
---|---|
Terraform Resource | aws_cloudfront_distribution |
Terraform Operation | Read (refreshing state or waiting for the distribution to be deployed/destroyed) |
AWS Service | CloudFront |
API Call | GetDistribution |
Terraform Environment | AWS VPC |
Terraform Concurrency | 10 (default) |
Known HTTP Proxy | No |
How Many Resources | 2 |
How Often | 90% of runs |
At first I assumed this was due to Terraform polling and waiting for the distribution to be deployed, which is why I added wait_for_deployment = false
, yet it seems to have even worsened the behaviour and it's even failing when refreshing the state. I saw the bulk of the errors happening yesterday when also disabling Cloudfront distributions seemed to take a very long time. This morning upon retrying again, the error rate is way less.
We haven't had this happen for more than a week now Could be fixed on aws side?
Hi again 👋 Since it appears that this was handled on the AWS side (both in this issue and lack of Terraform support tickets), our preference will be to leave things as they are for now. If this comes up again, especially since CloudFront seems to very prominently have this issue when it occurs, we can definitely think more about this network connection handling. 👍
I started seeing these today.
Error: Error reading IAM policy version arn:aws:iam::XXXX:policy/OktaChildAccountPolicy: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.1.216:52180->52.94.225.3:443: read: connection reset by peer
Error: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.1.216:52172->52.94.225.3:443: read: connection reset by peer
Error: Error reading IAM Role Okta-Idp-cross-account-role: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.1.216:52171->52.94.225.3:443: read: connection reset by peer
@tibbon I was seeing it few minutes ago and now its gone
I'm seeing it and the issues persist. I've been restarting my CI pipeline for about half an hour hoping it's transient, but it's sticking around. Likewise, mine is with the iam.amazonaws.com
Edit: 40th minute was the charm. You can force through it with enough retries. As far as I could tell, I only had 2 or 3 items that were failing. If you have many more, you might just be probablistically stuck until the broader problem is resolved.
Same here. Started half an hour ago.
I just starting hitting this issue, too. It's an old Terraform project, which we run several times per week. All of a sudden, it's causing problems.
Error: Error reading IAM Role ABCDEF: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.0.129:44552->52.94.225.3:443: read: connection reset by peer
Error: error finding IAM Role (GHIJKL) Policy Attachment (arn:aws:iam::aws:policy/AmazonInspectorFullAccess): RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.0.129:44872->52.94.225.3:443: read: connection reset by peer
https://status.aws.amazon.com/
1:50 PM PDT We are investigating increased error rates and latencies affecting IAM. IAM related requests to other AWS services may also be impacted.
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!
Terraform Version
We're running a drift detection workflow using github hosted github actions, which simply runs terraform plan and fails if it outputs anything. This runs on a schedule every hour. We're getting request errors, causing terraform plan to fail, around 2-3 times a day
Some of the request errors we've so far encountered:
Most of these seem to be CloudFront and S3
Thanks