hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.74k stars 9.1k forks source link

[Bug]: AWS STS TLS Timeouts when running `terraform init` #39125

Open neogibson opened 1 week ago

neogibson commented 1 week ago

Terraform Core Version

1.9.2

AWS Provider Version

5.65.0

Affected Resource(s)

This is affecting provider blocks where we define an assume_role to assume roles into accounts for terraform to perform its plan/apply

Expected Behavior

In AWS provider version 5.64.0, terraform init performs as expected and properly assumes all roles for the provider blocks in our various accounts.

Actual Behavior

In AWS provider version 5.65.0, terraform init fails to assume roles correctly and errors with STS TLS timeout errors.

Relevant Error/Panic Output Snippet

Planning failed. Terraform encountered an error while generating this plan.
Error: Cannot assume IAM Role
  with provider["registry.terraform.io/hashicorp/aws"].<alias>,
  on initialize.tf line 17, in provider "aws":
  17: provider "aws" {
IAM Role (arn:aws:iam::<account-id>:role/<role-name>) cannot be
assumed.
There are a number of possible causes of this - the most common are:
  * The credentials used in order to assume the role are invalid
  * The credentials do not have appropriate permission to assume the role
  * The role ARN is not valid
Error: operation error STS: AssumeRole, exceeded maximum number of attempts,
3, https response error StatusCode: 0, RequestID: , request send failed, Post
"https://sts.ca-central-1.amazonaws.com/": net/http: TLS handshake timeout
Error: Cannot assume IAM Role
  with provider["registry.terraform.io/hashicorp/aws"].<alias>,
  on initialize.tf line 31, in provider "aws":
  31: provider "aws" {
IAM Role (arn:aws:iam::<account-id>:role/<role-name>) cannot be
assumed.
There are a number of possible causes of this - the most common are:
  * The credentials used in order to assume the role are invalid
  * The credentials do not have appropriate permission to assume the role
  * The role ARN is not valid
Error: operation error STS: AssumeRole, exceeded maximum number of attempts,
3, https response error StatusCode: 0, RequestID: , request send failed, Post
"https://sts.ca-central-1.amazonaws.com/": net/http: TLS handshake timeout

Terraform Configuration Files

provider "aws" {
  alias       = "<alias>"
  region      = "<region>"
  max_retries = 20

  assume_role {
    role_arn = "arn:aws:iam::<role-name>"
  }

  default_tags {
    tags = local.provider_tags
  }
}

Steps to Reproduce

  1. We are running our terraform plans on AWS Codebuild in VPCs behind a domain allowlist firewall
  2. We have sts.ca-central-1.amazonaws.com allowlisted in our firewall, before provider version 5.65.0 this was working as expected and the codebuilds in the VPCs were authing to STS normally.
  3. Try to run terraform init, fails with STS TLS timeout errors.

Debug Output

No response

Panic Output

No response

Important Factoids

This is almost certainly related to the fact that we run some terraform plans in AWS Codebuild placed in AWS VPCs, which are behind an allow-list firewall, restricting the domain names and IPs that these codebuilds can reach out to.

We have the STS endpoint sts.ca-central-1.amazonaws.com allow listed in our firewall, and before provider version 5.65.0 this allowlist was working as expected.

We do see blocks in the firewall logs when these codebuilds reach out to the AWS STS endpoint, it seems terraform is sending requests directly to the IPs of STS, and not the domain name. Usually if a domain name is being sent a request, we see the domain name in our firewall logs when it gets blocked, but with these we are only getting the IP address of the STS endpoints. Since we don't allowlist the AWS IPs for the STS service, only the domain name sts.ca-central-1.amazonaws.com, I believe this is the basic problem.

It's not feasible to allowlist all AWS IPs related to the STS service so any insight into why this provider version change could be causing this would be greatly appreciated!

References

No response

Would you like to implement a fix?

None

github-actions[bot] commented 1 week ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

justinretzolk commented 1 week ago

Similar #39115

Edit: πŸ€¦β€β™‚οΈ sorry about the closure there

kevanslumin commented 1 week ago

We're facing this issue too, but I wanted to add more specifics. 5.64.0 and below would set the ServerName in the tls.Config of the HTTP Transport. This would cause it to be sent in the Server Name Indication section in the ClientHello of TLS connections. Domain based firewalls need SNI to be set on HTTPS connections to work correctly.

I noticed that 5.65.0 upgrades to Go 1.23, which does support Encrypted Client Hello. So it might be related to something like Encrypted Client Hello being used now, or some other change with Go 1.23 where ServerName isn't being set. It's also possible it's a change to the aws-sdk, and not terraform directly. It's hard to pinpoint exactly where the HTTP Transport is setup.

dejongm commented 1 week ago

We are facing this issue as well. It does not only affect STS. It's seemingly affecting many, if not all, AWS service calls including EC2, SSM, Organization, Route53, to name a few. @kevanslumin's statement above is indicative to what we are experiencing. Pinning the provider to 5.64.0 is our current workaround.

ewbankkit commented 1 week ago

The HTTP client used is setup here: https://github.com/hashicorp/aws-sdk-go-base/blob/main/http_client.go.

kevanslumin commented 3 days ago

I ran a small test program to use the Hashicorp http client linked above and did a packet capture. It turns out that it is sending SNI in the Client Hello after all. So after that I started to suspect that the bug might actually be in Suricata, which is what the AWS Firewall is based on, and I found this: https://forum.suricata.io/t/suricata-cannot-detect-tls-sni-from-overlongtls-packets/4640/2. The AWS version of Suricata is 6.0.9 from their docs.

@nathwill built a program with Go 1.22 and Go 1.23, and the packet capture showed the Client Hello for 1.22 is around 300 bytes whereas 1.23 is around 1500. We found that there was an additional key exchange algorithm used in 1.23. He found the following line on the Go 1.23 release notes (https://tip.golang.org/doc/go1.23#:~:text=The%20experimental%20post%2Dquantum%20key%20exchange%20mechanism%20X25519Kyber768Draft00%20is%20now%20enabled%20by%20default%20when%20Config.CurvePreferences%20is%20nil.%20The%20default%20can%20be%20reverted%20by%20adding%20tlskyber%3D0%20to%20the%20GODEBUG%20environment%20variable):

The experimental post-quantum key exchange mechanism X25519Kyber768Draft00 is now enabled by default when [Config.CurvePreferences](https://tip.golang.org/pkg/crypto/tls#Config.CurvePreferences) is nil. The default can be reverted by adding tlskyber=0 to the GODEBUG environment variable.

We tried setting GODEBUG with export GODEBUG=tlskyber=0 and running terraform, and it seems to work properly with SNI being detected at the AWS Firewall and being allowed through.

We plan on opening an AWS case about it, so we'll see how that goes. In the meantime, hopefully this workaround will be effective for everyone.

ewbankkit commented 3 days ago

@kevanslumin Many thanks for your persistence on this issue and the detailed explanation πŸ‘. For this week's release (v5.67.0, likely available later today) we reverted to go1.22.6 (https://github.com/hashicorp/terraform-provider-aws/pull/39256) but are planning on re-upgrading to go1.23.x ASAP.

kevanslumin commented 3 days ago

@ewbankkit That sounds good, thanks. Hypothetically, you might be able to set tls.Config.CurvePreferences to make the workaround unneeded. It doesn't seem like it's really the aws terraform provider's problem, but it could stem the tide of issues about it.

ewbankkit commented 3 days ago

Yes, we will need to discuss solution/workaround.

ewbankkit commented 2 days ago

We have opened https://github.com/hashicorp/terraform-provider-aws/issues/39311 to capture the longer-term work.