private_nat_gateway fail to create on time causing Terraform to fail and it is unrecoverable

ccourtoyaxis commented 3 years ago

Description

We currently deploy and destroy our VPCs regularly to test the deployment. It randomly, but frequently fails (I would say more than 50% of the time). The problem is that when I re-run terraform apply to continue where it left off, it complains that the resource already exist! I have to destroy th VPC and start from scratch again (praying that it won't fail this time).

Versions

Terraform: Terraform v1.0.8 on linux_amd64
Provider(s):
- provider registry.terraform.io/hashicorp/aws v3.62.0
Module: 3.7.0

Reproduction

All you need to do is deploy and destroy regularly.

Notes:

This VPC is the source for creating the following resources in the same run:
- AWS Security Groups
- AWS RDS Database
- AWS Route 53 Zone
- AWS ACM Certificate request
- AWS S3 buckets
We use AWS S3 bucket to store state. (created separately)
This task fails on our CI pipeline which starts with no existing .terraform folder. (clean slate at all times)

Code Snippet to Reproduce

VPC Terraform File

module "dev_vpc" {
  source             = "terraform-aws-modules/vpc/aws"
  version            = "3.7.0"
  name               = local.vpc_name
  cidr               = "10.0.0.0/16"
  azs                = local.azs
  private_subnets    = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets     = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  enable_nat_gateway = true

  tags = {
    // This is so kops knows that the VPC resources can be used for k8s
    "kubernetes.io/cluster/${aws_route53_zone.cluster_zone.name}" = "shared"
    "terraform"                                              = true
    "environment"                                            = local.environment
  }

  // Tags required by k8s to launch services on the right subnets
  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = true
  }

  public_subnet_tags = {
    "kubernetes.io/role/elb" = true
  }
}

See attached the full terraform definition :

vpc_terraform.tar.gz

Expected behavior

I expect to be able to deploy without a glitch. However, failure happens, but at that point, it should be idempotent and I should be able to continue the execution simply by re-running terraform apply

Actual behavior

Deployment fails

Terminal Output Screenshot(s)

aws_db_instance.database: Creation complete after 9m49s [id=staging-acs-database]
╷
│ Error: error waiting for Route in Route Table (rtb-000930092d9302f37) with destination (0.0.0.0/0) to become available: couldn't find resource (21 retries)
│ 
│   with module.dev_vpc.aws_route.priva
[vpc_terraform.tar.gz](https://github.com/terraform-aws-modules/terraform-aws-vpc/files/7338289/vpc_terraform.tar.gz)
te_nat_gateway[0],
│   on .terraform/modules/dev_vpc/main.tf line 1118, in resource "aws_route" "private_nat_gateway":
│ 1118: resource "aws_route" "private_nat_gateway" {
│ 
╵
╷
│ Error: error waiting for Route in Route Table (rtb-07c024d01f103ab59) with destination (0.0.0.0/0) to become available: couldn't find resource (21 retries)
│ 
│   with module.dev_vpc.aws_route.private_nat_gateway[2],
│   on .terraform/modules/dev_vpc/main.tf line 1118, in resource "aws_route" "private_nat_gateway":
│ 1118: resource "aws_route" "private_nat_gateway" {
│ 
╵
╷
│ Error: error waiting for Route in Route Table (rtb-0fec27fb9a061c8e8) with destination (0.0.0.0/0) to become available: couldn't find resource (21 retries)
│ 
│   with module.dev_vpc.aws_route.private_nat_gateway[1],
│   on .terraform/modules/dev_vpc/main.tf line 1118, in resource "aws_route" "private_nat_gateway":
│ 1118: resource "aws_route" "private_nat_gateway" {
│ 
╵
╷
│ Error: error reading Route Table Association (rtbassoc-0016d41124c33c00c): empty result
│ 
│   with module.dev_vpc.aws_route_table_association.public[2],
│   on .terraform/modules/dev_vpc/main.tf line 1212, in resource "aws_route_table_association" "public":
│ 1212: resource "aws_route_table_association" "public" {
│ 
╵

Additional context

We use the basic AWS support plan (i.e. no support), could this mean that the Service Level Agreement provides latencies that are outside of the module requirements? Still it takes a very long time to create Route Tables.

antonbabenko commented 3 years ago

I don't think it is related to the AWS support plan but it sometimes helps to ask AWS support directly if they can see something on their end.

From the code point of view, I don't see anything unusual. I recommend reducing the number of resources to find the issue easier and check AWS service limits (EIP limit is often the reason).

Try to specify single_nat_gateway = true to not hit EIP limit.

If the problem is reproducible and is related to this module, please provide small piece of code which triggers the problem and console output.

ccourtoyaxis commented 3 years ago

OK, thanks for the tip. I am far from my Elastic IP Quota though. Also, the gateways are created in the end, so either Terraform is unable to retieve the resource status or they just take too long to create. Is there a way to change the timouts in the module?

Regardless, I will implement a condition for our feature development deployments vs. staging and production. to limit the number of Elasitc IPs

antonbabenko commented 3 years ago

We had this kind of issue with someone who was recreating VPC resources in the CI/CD pipeline multiple times a day but it was in 2018 and not since that. The solution (timeouts { create = "5m" }) was added into this module since then.

antonbabenko commented 3 years ago

I found the related issue and fix in the recent release of Terraform AWS provider: https://github.com/hashicorp/terraform-provider-aws/issues/19985 https://github.com/hashicorp/terraform-provider-aws/pull/21161

Could you try to use the previous or latest version of Terraform AWS provider to see if a problem is fixed? It is likely related to the provider and not to the module.

ccourtoyaxis commented 3 years ago

I found the following open issue with the provider. I think this is the root cause of the problem

https://github.com/hashicorp/terraform-provider-aws/issues/21032

antonbabenko commented 3 years ago

Yes, looks like it. There is already #701 which can be extended to have longer timeouts for other resources, too. Will you be able to chime in and update that PR?

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

terraform-aws-modules / terraform-aws-vpc