hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.83k stars 9.18k forks source link

[Bug]: Intermittent error on apply, refresh, or destroy: failed to refresh cached credentials #28364

Open dedeen opened 1 year ago

dedeen commented 1 year ago

Terraform Core Version

v1.3.6

AWS Provider Version

v4.46.0

Affected Resource(s)

When using for_each over variable.tf file, the aws_vpc module fails intermittently to either apply or destroy a VPC. The error indicates: Sign Request ec2/CreateSubnet failed, not retrying, error failed to refresh cached credentials, failed to load credentials, deserialization failed, failed to decode error message, EOF.

This doesn't happen if not using for_each.

The failure is intermittent, and rerunning either the apply or destroy usually works correctly.

Expected Behavior

VPC should be created, with private and public subnet, IGW, NATGW.

Actual Behavior

If building (applying), some parts of VPC are built. Rerunning typically builds those parts that failed. On destroy, VPC deletion fails, and rerunning typically works.

Relevant Error/Panic Output Snippet

022-12-15T00:16:20.800Z [DEBUG] provider.terraform-provider-aws_v4.46.0_x5: [DEBUG] [aws-sdk-go] DEBUG: Sign Request ec2/CreateSubnet failed, not retrying, error failed to refresh cached credentials, failed to load credentials, deserialization failed, failed to decode error message, EOF
2022-12-15T00:16:20.800Z [TRACE] provider.terraform-provider-aws_v4.46.0_x5: Called downstream: @module=sdk.helper_schema tf_mux_provider=*schema.GRPCProviderServer tf_provider_addr=registry.terraform.io/hashicorp/aws tf_req_id=2634bc46-bb66-3d22-528d-d2eaf8165f52 tf_resource_type=aws_subnet tf_rpc=ApplyResourceChange @caller=github.com/hashicorp/terraform-plugin-sdk/v2@v2.24.1/helper/schema/resource.go:838 timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [TRACE] maybeTainted: module.vpc["datacenter1"].aws_internet_gateway.this[0] encountered an error during creation, so it is now marked as tainted
2022-12-15T00:16:20.800Z [TRACE] provider.terraform-provider-aws_v4.46.0_x5: Received downstream response: tf_req_id=2634bc46-bb66-3d22-528d-d2eaf8165f52 tf_resource_type=aws_subnet tf_rpc=ApplyResourceChange diagnostic_error_count=1 diagnostic_warning_count=0 tf_proto_version=5.3 tf_provider_addr=registry.terraform.io/hashicorp/aws tf_req_duration_ms=5 @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/internal/tf5serverlogging/downstream_request.go:37 @module=sdk.proto timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [ERROR] provider.terraform-provider-aws_v4.46.0_x5: Response contains error diagnostic: tf_provider_addr=registry.terraform.io/hashicorp/aws tf_resource_type=aws_subnet @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/internal/diag/diagnostics.go:55 diagnostic_detail= diagnostic_severity=ERROR diagnostic_summary="error creating EC2 Subnet: failed to refresh cached credentials, failed to load credentials, deserialization failed, failed to decode error message, EOF" tf_proto_version=5.3 @module=sdk.proto tf_req_id=2634bc46-bb66-3d22-528d-d2eaf8165f52 tf_rpc=ApplyResourceChange timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [TRACE] provider.terraform-provider-aws_v4.46.0_x5: Served request: tf_provider_addr=registry.terraform.io/hashicorp/aws tf_req_id=2634bc46-bb66-3d22-528d-d2eaf8165f52 @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/tf5server/server.go:831 @module=sdk.proto tf_proto_version=5.3 tf_resource_type=aws_subnet tf_rpc=ApplyResourceChange timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [DEBUG] provider.terraform-provider-aws_v4.46.0_x5: [DEBUG] [aws-sdk-go] DEBUG: Sign Request ec2/CreateInternetGateway failed, not retrying, error failed to refresh cached credentials, failed to load credentials, deserialization failed, failed to decode error message, EOF
2022-12-15T00:16:20.800Z [TRACE] provider.terraform-provider-aws_v4.46.0_x5: Called downstream: @caller=github.com/hashicorp/terraform-plugin-sdk/v2@v2.24.1/helper/schema/resource.go:838 tf_mux_provider=*schema.GRPCProviderServer tf_provider_addr=registry.terraform.io/hashicorp/aws tf_rpc=ApplyResourceChange @module=sdk.helper_schema tf_req_id=6bf5c123-55fb-d840-bbf4-d84f472bd996 tf_resource_type=aws_internet_gateway timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [TRACE] provider.terraform-provider-aws_v4.46.0_x5: Received downstream response: tf_provider_addr=registry.terraform.io/hashicorp/aws tf_req_duration_ms=1 tf_req_id=6bf5c123-55fb-d840-bbf4-d84f472bd996 tf_resource_type=aws_internet_gateway tf_rpc=ApplyResourceChange diagnostic_error_count=1 @module=sdk.proto diagnostic_warning_count=0 tf_proto_version=5.3 @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/internal/tf5serverlogging/downstream_request.go:37 timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [ERROR] provider.terraform-provider-aws_v4.46.0_x5: Response contains error diagnostic: diagnostic_detail= diagnostic_severity=ERROR diagnostic_summary="error creating EC2 Internet Gateway: failed to refresh cached credentials, failed to load credentials, deserialization failed, failed to decode error message, EOF" tf_proto_version=5.3 tf_provider_addr=registry.terraform.io/hashicorp/aws tf_req_id=6bf5c123-55fb-d840-bbf4-d84f472bd996 @module=sdk.proto tf_resource_type=aws_internet_gateway tf_rpc=ApplyResourceChange @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/internal/diag/diagnostics.go:55 timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [TRACE] provider.terraform-provider-aws_v4.46.0_x5: Served request: tf_resource_type=aws_internet_gateway tf_proto_version=5.3 tf_rpc=ApplyResourceChange @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/tf5server/server.go:831 @module=sdk.proto tf_provider_addr=registry.terraform.io/hashicorp/aws tf_req_id=6bf5c123-55fb-d840-bbf4-d84f472bd996 timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState to workingState for module.vpc["datacenter1"].aws_internet_gateway.this[0]
2022-12-15T00:16:20.800Z [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState: removing state object for module.vpc["datacenter1"].aws_internet_gateway.this[0]
2022-12-15T00:16:20.800Z [TRACE] evalApplyProvisioners: module.vpc["datacenter1"].aws_internet_gateway.this[0] is tainted, so skipping provisioning
2022-12-15T00:16:20.800Z [TRACE] maybeTainted: module.vpc["datacenter1"].aws_internet_gateway.this[0] was already tainted, so nothing to do
2022-12-15T00:16:20.800Z [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState to workingState for module.vpc["datacenter1"].aws_internet_gateway.this[0]
2022-12-15T00:16:20.800Z [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState: removing state object for module.vpc["datacenter1"].aws_internet_gateway.this[0]
2022-12-15T00:16:20.801Z [TRACE] maybeTainted: module.vpc["datacenter2"].aws_subnet.public[0] encountered an error during creation, so it is now marked as tainted
2022-12-15T00:16:20.801Z [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState to workingState for module.vpc["datacenter2"].aws_subnet.public[0]
2022-12-15T00:16:20.801Z [TRACE] statemgr.Filesystem: have already backed up original terraform.tfstate to terraform.tfstate.backup on a previous write
2022-12-15T00:16:20.801Z [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 318
2022-12-15T00:16:20.801Z [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate

Terraform Configuration Files

main.tf

module "vpc" {
  source          = "terraform-aws-modules/vpc/aws"

  for_each = var.oregon_dcs
    providers = {
      aws = aws.usw2  # Set region via provider alias
    }
    name              = each.value.region_dc
    cidr              = each.value.cidr
    azs               = each.value.az_list
    private_subnets   = [each.value.priv_subnet]
    public_subnets    = [each.value.publ_subnet]
    enable_ipv6             = false
    enable_nat_gateway      = true
    one_nat_gateway_per_az  = false # one_nat=false&single_nat=true =>single NATGW
    single_nat_gateway      = true  # one_nat=true&single_nat=false => one NATGW per AZ
}

vars.tf

variable "oregon_dcs" {
    description = "DC parms for Oregon VPCs"
    type        = map(any)

    default = {
        datacenter1 = {
            #aws_region = "us-west-2"
            region_dc   = "oregon-dc1"
            cidr        = "192.168.0.0/16"
            az_list     = ["us-west-2a","us-west-2b"]
            publ_subnet = "192.168.1.0/24"
            priv_subnet = "192.168.2.0/24"       
        },   
        datacenter2 = {
            #aws_region = "us-west-2"
            region_dc   = "oregon-dc2"
            cidr        = "192.168.0.0/16"
            az_list     = ["us-west-2a","us-west-2b"]
            publ_subnet = "192.168.3.0/24"
            priv_subnet = "192.168.4.0/24"       
        }   
    }   
}

provider.tf

provider "aws" {
  alias = "usw2"
  region = "us-west-2"

  default_tags {
    tags = {
      Environment = "via-terraform"
      Owner = "via-terraform"
    }
  }
}

Steps to Reproduce

  1. I am running terraform from AWS Cloudshell (linux_amd64).
  2. Run the attached .tf files, the failure doesn't always occur but usually fails within 3 or 4 runs.

Debug Output

022-12-15T00:16:20.800Z [DEBUG] provider.terraform-provider-aws_v4.46.0_x5: [DEBUG] [aws-sdk-go] DEBUG: Sign Request ec2/CreateSubnet failed, not retrying, error failed to refresh cached credentials, failed to load credentials, deserialization failed, failed to decode error message, EOF
2022-12-15T00:16:20.800Z [TRACE] provider.terraform-provider-aws_v4.46.0_x5: Called downstream: @module=sdk.helper_schema tf_mux_provider=*schema.GRPCProviderServer tf_provider_addr=registry.terraform.io/hashicorp/aws tf_req_id=2634bc46-bb66-3d22-528d-d2eaf8165f52 tf_resource_type=aws_subnet tf_rpc=ApplyResourceChange @caller=github.com/hashicorp/terraform-plugin-sdk/v2@v2.24.1/helper/schema/resource.go:838 timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [TRACE] maybeTainted: module.vpc["datacenter1"].aws_internet_gateway.this[0] encountered an error during creation, so it is now marked as tainted
2022-12-15T00:16:20.800Z [TRACE] provider.terraform-provider-aws_v4.46.0_x5: Received downstream response: tf_req_id=2634bc46-bb66-3d22-528d-d2eaf8165f52 tf_resource_type=aws_subnet tf_rpc=ApplyResourceChange diagnostic_error_count=1 diagnostic_warning_count=0 tf_proto_version=5.3 tf_provider_addr=registry.terraform.io/hashicorp/aws tf_req_duration_ms=5 @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/internal/tf5serverlogging/downstream_request.go:37 @module=sdk.proto timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [ERROR] provider.terraform-provider-aws_v4.46.0_x5: Response contains error diagnostic: tf_provider_addr=registry.terraform.io/hashicorp/aws tf_resource_type=aws_subnet @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/internal/diag/diagnostics.go:55 diagnostic_detail= diagnostic_severity=ERROR diagnostic_summary="error creating EC2 Subnet: failed to refresh cached credentials, failed to load credentials, deserialization failed, failed to decode error message, EOF" tf_proto_version=5.3 @module=sdk.proto tf_req_id=2634bc46-bb66-3d22-528d-d2eaf8165f52 tf_rpc=ApplyResourceChange timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [TRACE] provider.terraform-provider-aws_v4.46.0_x5: Served request: tf_provider_addr=registry.terraform.io/hashicorp/aws tf_req_id=2634bc46-bb66-3d22-528d-d2eaf8165f52 @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/tf5server/server.go:831 @module=sdk.proto tf_proto_version=5.3 tf_resource_type=aws_subnet tf_rpc=ApplyResourceChange timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [DEBUG] provider.terraform-provider-aws_v4.46.0_x5: [DEBUG] [aws-sdk-go] DEBUG: Sign Request ec2/CreateInternetGateway failed, not retrying, error failed to refresh cached credentials, failed to load credentials, deserialization failed, failed to decode error message, EOF
2022-12-15T00:16:20.800Z [TRACE] provider.terraform-provider-aws_v4.46.0_x5: Called downstream: @caller=github.com/hashicorp/terraform-plugin-sdk/v2@v2.24.1/helper/schema/resource.go:838 tf_mux_provider=*schema.GRPCProviderServer tf_provider_addr=registry.terraform.io/hashicorp/aws tf_rpc=ApplyResourceChange @module=sdk.helper_schema tf_req_id=6bf5c123-55fb-d840-bbf4-d84f472bd996 tf_resource_type=aws_internet_gateway timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [TRACE] provider.terraform-provider-aws_v4.46.0_x5: Received downstream response: tf_provider_addr=registry.terraform.io/hashicorp/aws tf_req_duration_ms=1 tf_req_id=6bf5c123-55fb-d840-bbf4-d84f472bd996 tf_resource_type=aws_internet_gateway tf_rpc=ApplyResourceChange diagnostic_error_count=1 @module=sdk.proto diagnostic_warning_count=0 tf_proto_version=5.3 @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/internal/tf5serverlogging/downstream_request.go:37 timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [ERROR] provider.terraform-provider-aws_v4.46.0_x5: Response contains error diagnostic: diagnostic_detail= diagnostic_severity=ERROR diagnostic_summary="error creating EC2 Internet Gateway: failed to refresh cached credentials, failed to load credentials, deserialization failed, failed to decode error message, EOF" tf_proto_version=5.3 tf_provider_addr=registry.terraform.io/hashicorp/aws tf_req_id=6bf5c123-55fb-d840-bbf4-d84f472bd996 @module=sdk.proto tf_resource_type=aws_internet_gateway tf_rpc=ApplyResourceChange @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/internal/diag/diagnostics.go:55 timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [TRACE] provider.terraform-provider-aws_v4.46.0_x5: Served request: tf_resource_type=aws_internet_gateway tf_proto_version=5.3 tf_rpc=ApplyResourceChange @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/tf5server/server.go:831 @module=sdk.proto tf_provider_addr=registry.terraform.io/hashicorp/aws tf_req_id=6bf5c123-55fb-d840-bbf4-d84f472bd996 timestamp=2022-12-15T00:16:20.799Z
2022-12-15T00:16:20.800Z [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState to workingState for module.vpc["datacenter1"].aws_internet_gateway.this[0]
2022-12-15T00:16:20.800Z [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState: removing state object for module.vpc["datacenter1"].aws_internet_gateway.this[0]
2022-12-15T00:16:20.800Z [TRACE] evalApplyProvisioners: module.vpc["datacenter1"].aws_internet_gateway.this[0] is tainted, so skipping provisioning
2022-12-15T00:16:20.800Z [TRACE] maybeTainted: module.vpc["datacenter1"].aws_internet_gateway.this[0] was already tainted, so nothing to do
2022-12-15T00:16:20.800Z [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState to workingState for module.vpc["datacenter1"].aws_internet_gateway.this[0]
2022-12-15T00:16:20.800Z [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState: removing state object for module.vpc["datacenter1"].aws_internet_gateway.this[0]
2022-12-15T00:16:20.801Z [TRACE] maybeTainted: module.vpc["datacenter2"].aws_subnet.public[0] encountered an error during creation, so it is now marked as tainted
2022-12-15T00:16:20.801Z [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState to workingState for module.vpc["datacenter2"].aws_subnet.public[0]
2022-12-15T00:16:20.801Z [TRACE] statemgr.Filesystem: have already backed up original terraform.tfstate to terraform.tfstate.backup on a previous write
2022-12-15T00:16:20.801Z [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 318
2022-12-15T00:16:20.801Z [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate

Panic Output

No response

Important Factoids

The problem only occurs for me running for_each, as shown in the attached scripts.

References

https://gist.github.com/dedeen/bf1c23cd2cd8d29ecd0a1fa0ff6d8e0c

Would you like to implement a fix?

None

github-actions[bot] commented 1 year ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue

dedeen commented 1 year ago

I've found that this also happens on refresh occasionally, and retrying the refresh typically works around the issue.

justinretzolk commented 1 year ago

Hey @dedeen πŸ‘‹ Thank you for taking the time to raise this! Based on the error, it seems like the credentials being used to authenticate the AWS provider are becoming invalid. With that and that you mentioned you're running in Cloudshell, I'm wondering if your Cloudshell session is timing out due to user inactivity (running processes like Terraform won't count as user activity). That would also explain why this seems to occur when using for_each or refreshes -- things that can wind up making plans and applies run a little longer at times.

dedeen commented 1 year ago

Hi Justin, Thanks for taking a look. I'm pretty sure that my session isn't timing out as I am watching the output of the apply in real time and the build only takes a few minutes. I can up-arrow and rerun the script, sometimes it completes without issue the second or third time.

justinretzolk commented 1 year ago

Hey @dedeen πŸ‘‹ Thank you for the update. If possible, can you supply full debug logs (redacted as needed) as well?

dedeen commented 1 year ago

Hi, The debug logs are at this link (noted on the original bug). Please let me know if you need additional information. https://gist.github.com/dedeen/bf1c23cd2cd8d29ecd0a1fa0ff6d8e0c

gdavison commented 1 year ago

Hi @dedeen, the debug log is cut off. Could you please include the full log from the beginning?

Also, what happens if you don't use for_each, but call the module twice, once for each item in oregon_dcs?