Open neogibson opened 1 week ago
Voting for Prioritization
Volunteering to Work on This Issue
Similar #39115
Edit: π€¦ββοΈ sorry about the closure there
We're facing this issue too, but I wanted to add more specifics. 5.64.0 and below would set the ServerName in the tls.Config of the HTTP Transport. This would cause it to be sent in the Server Name Indication section in the ClientHello of TLS connections. Domain based firewalls need SNI to be set on HTTPS connections to work correctly.
I noticed that 5.65.0 upgrades to Go 1.23, which does support Encrypted Client Hello. So it might be related to something like Encrypted Client Hello being used now, or some other change with Go 1.23 where ServerName isn't being set. It's also possible it's a change to the aws-sdk, and not terraform directly. It's hard to pinpoint exactly where the HTTP Transport is setup.
We are facing this issue as well. It does not only affect STS. It's seemingly affecting many, if not all, AWS service calls including EC2, SSM, Organization, Route53, to name a few. @kevanslumin's statement above is indicative to what we are experiencing. Pinning the provider to 5.64.0 is our current workaround.
The HTTP client used is setup here: https://github.com/hashicorp/aws-sdk-go-base/blob/main/http_client.go.
I ran a small test program to use the Hashicorp http client linked above and did a packet capture. It turns out that it is sending SNI in the Client Hello after all. So after that I started to suspect that the bug might actually be in Suricata, which is what the AWS Firewall is based on, and I found this: https://forum.suricata.io/t/suricata-cannot-detect-tls-sni-from-overlongtls-packets/4640/2. The AWS version of Suricata is 6.0.9 from their docs.
@nathwill built a program with Go 1.22 and Go 1.23, and the packet capture showed the Client Hello for 1.22 is around 300 bytes whereas 1.23 is around 1500. We found that there was an additional key exchange algorithm used in 1.23. He found the following line on the Go 1.23 release notes (https://tip.golang.org/doc/go1.23#:~:text=The%20experimental%20post%2Dquantum%20key%20exchange%20mechanism%20X25519Kyber768Draft00%20is%20now%20enabled%20by%20default%20when%20Config.CurvePreferences%20is%20nil.%20The%20default%20can%20be%20reverted%20by%20adding%20tlskyber%3D0%20to%20the%20GODEBUG%20environment%20variable):
The experimental post-quantum key exchange mechanism X25519Kyber768Draft00 is now enabled by default when [Config.CurvePreferences](https://tip.golang.org/pkg/crypto/tls#Config.CurvePreferences) is nil. The default can be reverted by adding tlskyber=0 to the GODEBUG environment variable.
We tried setting GODEBUG with export GODEBUG=tlskyber=0
and running terraform, and it seems to work properly with SNI being detected at the AWS Firewall and being allowed through.
We plan on opening an AWS case about it, so we'll see how that goes. In the meantime, hopefully this workaround will be effective for everyone.
@kevanslumin Many thanks for your persistence on this issue and the detailed explanation π.
For this week's release (v5.67.0, likely available later today) we reverted to go1.22.6
(https://github.com/hashicorp/terraform-provider-aws/pull/39256) but are planning on re-upgrading to go1.23.x
ASAP.
@ewbankkit That sounds good, thanks. Hypothetically, you might be able to set tls.Config.CurvePreferences to make the workaround unneeded. It doesn't seem like it's really the aws terraform provider's problem, but it could stem the tide of issues about it.
Yes, we will need to discuss solution/workaround.
We have opened https://github.com/hashicorp/terraform-provider-aws/issues/39311 to capture the longer-term work.
Terraform Core Version
1.9.2
AWS Provider Version
5.65.0
Affected Resource(s)
This is affecting
provider
blocks where we define anassume_role
to assume roles into accounts for terraform to perform its plan/applyExpected Behavior
In AWS provider version
5.64.0
,terraform init
performs as expected and properly assumes all roles for the provider blocks in our various accounts.Actual Behavior
In AWS provider version
5.65.0
,terraform init
fails to assume roles correctly and errors with STS TLS timeout errors.Relevant Error/Panic Output Snippet
Terraform Configuration Files
Steps to Reproduce
sts.ca-central-1.amazonaws.com
allowlisted in our firewall, before provider version5.65.0
this was working as expected and the codebuilds in the VPCs were authing to STS normally.terraform init
, fails with STS TLS timeout errors.Debug Output
No response
Panic Output
No response
Important Factoids
This is almost certainly related to the fact that we run some terraform plans in AWS Codebuild placed in AWS VPCs, which are behind an allow-list firewall, restricting the domain names and IPs that these codebuilds can reach out to.
We have the STS endpoint
sts.ca-central-1.amazonaws.com
allow listed in our firewall, and before provider version5.65.0
this allowlist was working as expected.We do see blocks in the firewall logs when these codebuilds reach out to the AWS STS endpoint, it seems terraform is sending requests directly to the IPs of STS, and not the domain name. Usually if a domain name is being sent a request, we see the domain name in our firewall logs when it gets blocked, but with these we are only getting the IP address of the STS endpoints. Since we don't allowlist the AWS IPs for the STS service, only the domain name
sts.ca-central-1.amazonaws.com
, I believe this is the basic problem.It's not feasible to allowlist all AWS IPs related to the STS service so any insight into why this provider version change could be causing this would be greatly appreciated!
References
No response
Would you like to implement a fix?
None