Terraform launch more ec2 than planned, and it seems it looses track of some of them

HashiBeliver commented 3 years ago

Terraform Version

Terraform v0.13.5

Terraform Configuration Files

here is the github project:
https://github.com/HashiBeliver/terraform_bug

Debug Output

Crash Output

I didn't run it with the trace environment variable but here is a link to crash.log gist, is it ok?

https://gist.github.com/HashiBeliver/ced5553b228b6fa000bbafe042cde4a4

Expected Behavior

The terraform plan stated on the screen should have been carried out as written. deploying every infrastructure component, especially the ec2 instances.

Actual Behavior

it sometimes works well as expected and sometimes fails, fails meaning it keeps sending messages of ec2 creation (still creating...) until I interrupt it, and at that point a crash log is generated.
Usually there are 2 results when it fails
1. EC2 instance requests are hanging in the "still creating" process and are just not being created.
2. EC2 instance requests are hanging in the "still creating" process, but new duplicate instance of that hanging instance is being created in the background resulting in rouge infrastructure that terraform lose management of since in the state it registers only once, as it should have created only one. so terraform loses management over ec2 and deploys more than requested and what terraform planned.

Steps to Reproduce

Since I had it sometimes working and sometimes not, and when it doesnt had result differently I can only give some instructions and hopefully it will be reproduced at your end as well:

git clone https://github.com/HashiBeliver/terraform_bug
in there create a file terraform.tfvars and in it:
aws_provider_main_region = "" << whatever aws region like eu-west-1 
aws_credentials_profile  = "" << name of a aws profile in .aws that hold aws credentials for deployment
vpc_id                   = "" << vpc id in the chosen region

aws_keypair = "" << aws key that exists in the region
subnet_ids  = [""] << subnet id in the chosen vpc
ssh_ips     = ["10.10.10.10/32"] <<allowed ssh ip

es_initial_master_nodes_amount   = 1
es_dedicated_master_nodes_amount = 1
es_data_master_nodes_amount      = 1
es_dedicated_data_nodes_amount   = 4

save the file

terraform init
terraform apply

Additional Context

References

danieldreier commented 3 years ago

@HashiBeliver yes! Links to GitHub projects are a great way to share reproduction cases. As an example, I've put many reproduction cases in https://github.com/danieldreier/terraform-issue-reproductions which is my personal set of reproduction cases.

Regarding a crash log, my recommendation is that you only use private keys that you can immediately revoke prior to sharing the crash log, so that you don't need to worry about encrypting the crash log. If you prefer to encrypt it, there's a PGP key at https://www.hashicorp.com/security that you can use to encrypt it, which we can later decrypt. Be aware that you should still not share particularly sensitive values like this.

HashiBeliver commented 3 years ago

@danieldreier Hey, thanks for replying back, I'm a bit new so I accidently closed and reopened the issue :( hopefully it didn't kicked me down the list. anyway I don't have any sensitives keys used in the project, and I uploaded the gist and provided more details and the GitHub project

So what do you think ? :)

imriz commented 3 years ago

@HashiBeliver We also encountered this. In our case, this seems to come from the fact that the AWS Go SDK does retries, but the AWS Terraform provider will give up on the request, and start a new one. In our test case, the issue was "gone" when we changed the timeout here: https://github.com/hashicorp/terraform-provider-aws/blob/c63960b8b9bc6a8890dced588075d480ed09cefd/aws/resource_aws_instance.go#L630

Simply changing: err = resource.Retry(30*time.Second, func() *resource.RetryError { To: err = resource.Retry(30*time.Minute, func() *resource.RetryError {

Seems to fix it.

This is mainly triggered when trying to create a lot of resources, with high parallelism, and hitting AWS's rate limiting.

We will open a new bug on the provider for this, but this is a very nasty bug.

@danieldreier WDYT?

antonbabenko commented 3 years ago

Changing the retry period to something significantly larger than 30 secs means that retries won’t happen even if AWS API asks to retry (error 5xx, for example), right?

Here is the short code sample which creates 225 instances instead of 200:

provider "aws" {
  region = "eu-west-1"
}

resource "aws_instance" "ec2_instance" {
  count = 200

  ami              = data.aws_ami.amazon_linux.id
  instance_type    = "t3.nano"
  subnet_id        = tolist(data.aws_subnet_ids.all.ids)[0]

  tags = { Name : count.index}
}

data "aws_vpc" "default" { default = true }
data "aws_subnet_ids" "all" {
  vpc_id = data.aws_vpc.default.id
}

data "aws_ami" "amazon_linux" {
  most_recent = true

  owners = ["amazon"]

  filter {
    name = "name"

    values = [
      "amzn-ami-hvm-*-x86_64-gp2",
    ]
  }

  filter {
    name = "owner-alias"

    values = [
      "amazon",
    ]
  }
}

imriz commented 3 years ago

@antonbabenko I am not that's correct. First, the SDK itself will retry transient errors (which is basically the root cause here), and also this doesn't change the logic of when to retry, only how long to wait before retrying.

Anyway, I suggest we move the discussion to hashicorp/terraform-provider-aws#17638, since this seems to be a provider bug.

hashicorp / terraform