Open HashiBeliver opened 3 years ago
@HashiBeliver yes! Links to GitHub projects are a great way to share reproduction cases. As an example, I've put many reproduction cases in https://github.com/danieldreier/terraform-issue-reproductions which is my personal set of reproduction cases.
Regarding a crash log, my recommendation is that you only use private keys that you can immediately revoke prior to sharing the crash log, so that you don't need to worry about encrypting the crash log. If you prefer to encrypt it, there's a PGP key at https://www.hashicorp.com/security that you can use to encrypt it, which we can later decrypt. Be aware that you should still not share particularly sensitive values like this.
@danieldreier Hey, thanks for replying back, I'm a bit new so I accidently closed and reopened the issue :( hopefully it didn't kicked me down the list. anyway I don't have any sensitives keys used in the project, and I uploaded the gist and provided more details and the GitHub project
So what do you think ? :)
@HashiBeliver We also encountered this. In our case, this seems to come from the fact that the AWS Go SDK does retries, but the AWS Terraform provider will give up on the request, and start a new one. In our test case, the issue was "gone" when we changed the timeout here: https://github.com/hashicorp/terraform-provider-aws/blob/c63960b8b9bc6a8890dced588075d480ed09cefd/aws/resource_aws_instance.go#L630
Simply changing:
err = resource.Retry(30*time.Second, func() *resource.RetryError {
To:
err = resource.Retry(30*time.Minute, func() *resource.RetryError {
Seems to fix it.
This is mainly triggered when trying to create a lot of resources, with high parallelism, and hitting AWS's rate limiting.
We will open a new bug on the provider for this, but this is a very nasty bug.
@danieldreier WDYT?
Changing the retry period to something significantly larger than 30 secs means that retries won’t happen even if AWS API asks to retry (error 5xx, for example), right?
Here is the short code sample which creates 225 instances instead of 200:
provider "aws" {
region = "eu-west-1"
}
resource "aws_instance" "ec2_instance" {
count = 200
ami = data.aws_ami.amazon_linux.id
instance_type = "t3.nano"
subnet_id = tolist(data.aws_subnet_ids.all.ids)[0]
tags = { Name : count.index}
}
data "aws_vpc" "default" { default = true }
data "aws_subnet_ids" "all" {
vpc_id = data.aws_vpc.default.id
}
data "aws_ami" "amazon_linux" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = [
"amzn-ami-hvm-*-x86_64-gp2",
]
}
filter {
name = "owner-alias"
values = [
"amazon",
]
}
}
@antonbabenko I am not that's correct. First, the SDK itself will retry transient errors (which is basically the root cause here), and also this doesn't change the logic of when to retry, only how long to wait before retrying.
Anyway, I suggest we move the discussion to hashicorp/terraform-provider-aws#17638, since this seems to be a provider bug.
Terraform Version
Terraform Configuration Files
Debug Output
Crash Output
Expected Behavior
Actual Behavior
Steps to Reproduce
Additional Context
References