Open speller opened 1 year ago
Voting for Prioritization
Volunteering to Work on This Issue
I've checked one of the recent failing deployments, and found that the issue happens during the initial resource creation. Below are pieces of logs from the work on an empty state (completely new deployment).
Everything is looking good on the plan:
# module.service-bff.module.instance.aws_autoscaling_policy.cpu_utilization[0]will be created
resource "aws_autoscaling_policy" "cpu_utilization" {
arn = (known after apply)
autoscaling_group_name = (known after apply)
enabled = true
estimated_instance_warmup = 120
id = (known after apply)
metric_aggregation_type = (known after apply)
name = "CPUUtilization"
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
disable_scale_in = false
target_value = 70
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
}
}
# module.service-bff.module.instance.aws_autoscaling_policy.cpu_utilization_int[0]will be created
resource "aws_autoscaling_policy" "cpu_utilization_int" {
arn = (known after apply)
autoscaling_group_name = (known after apply)
enabled = true
estimated_instance_warmup = 120
id = (known after apply)
metric_aggregation_type = (known after apply)
name = "CPUUtilizationInt"
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
disable_scale_in = false
target_value = 70
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
}
}
# module.service-bff.module.instance.aws_autoscaling_policy.requests_count[0]will be created
resource "aws_autoscaling_policy" "requests_count" {
arn = (known after apply)
autoscaling_group_name = (known after apply)
enabled = true
estimated_instance_warmup = 120
id = (known after apply)
metric_aggregation_type = (known after apply)
name = "AVGRequestsCount"
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
disable_scale_in = false
target_value = 21
predefined_metric_specification {
predefined_metric_type = "ALBRequestCountPerTarget"
resource_label = (known after apply)
}
}
}
# module.service-bff.module.instance.aws_autoscaling_policy.requests_count_int[0]will be created
resource "aws_autoscaling_policy" "requests_count_int" {
arn = (known after apply)
autoscaling_group_name = (known after apply)
enabled = true
estimated_instance_warmup = 120
id = (known after apply)
metric_aggregation_type = (known after apply)
name = "AVGRequestsCountInt"
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
disable_scale_in = false
target_value = 21
predefined_metric_specification {
predefined_metric_type = "ALBRequestCountPerTarget"
resource_label = (known after apply)
}
}
}
In the application, it looks ok in the beginning:
module.service-bff.module.instance.aws_autoscaling_policy.requests_count_int[0]: Creating...
module.service-bff.module.instance.aws_autoscaling_policy.cpu_utilization_int[0]: Creating...
module.service-bff.module.instance.aws_autoscaling_policy.cpu_utilization[0]: Creating...
module.service-bff.module.instance.aws_autoscaling_policy.requests_count[0]: Creating...
But then, only three are reporting the success:
module.service-bff.module.instance.aws_autoscaling_policy.cpu_utilization[0]: Creation complete after 1s [id=CPUUtilization]
module.service-bff.module.instance.aws_autoscaling_policy.requests_count[0]: Creation complete after 1s [id=AVGRequestsCount]
module.service-bff.module.instance.aws_autoscaling_policy.requests_count_int[0]: Creation complete after 1s [id=AVGRequestsCountInt]
And eventually the following error is thrown:
│ Error: creating Auto Scaling Policy (CPUUtilizationInt): ValidationError: Only one TargetTrackingScaling policy for a given metric specification is allowed.
│ status code: 400, request id: 26e35a65-648d-4ac8-a45e-d93f85ab2579
│
│ with module.service-bff.module.instance.aws_autoscaling_policy.cpu_utilization_int[0],
│ on .terraform/modules/service-bff.instance/main.tf line 77, in resource "aws_autoscaling_policy" "cpu_utilization_int":
│ 77: resource "aws_autoscaling_policy" "cpu_utilization_int" {
│
Here is the issue - somewhere the target policy is created twice and this makes TF failing on already existing resource. Nobody except TF was able to create this resource because there was nothing.
When I check the problematic autoscaling group, I can see three policies created. I delete them all and on the next run TF succeeds.
Probably, this is a problem in our configuration but there's a bug on the AWS side that allows creating two ASGAverageCPUUtilization
policies under some circumstances... Our autoscaling groups are targeted by two load balancers, and for ALBRequestCountPerTarget
it is okay to have two policies. But ASGAverageCPUUtilization
is not dependent on load balancers so it should be only one but I copypasted the config and accidentally duplicated ASGAverageCPUUtilization
. And somehow it worked for a long time already in the majority of cases.
Here is an example of a working case when all four are created:
Terraform Core Version
1.5.0
AWS Provider Version
4.67.0
Affected Resource(s)
aws_autoscaling_policy
Expected Behavior
aws_autoscaling_policy
resources are handled properly.Actual Behavior
I'm getting the following errors randomly:
Relevant Error/Panic Output Snippet
No response
Terraform Configuration Files
I have the following configurations for autoscaling policies:
Steps to Reproduce
Unknown
Debug Output
No response
Panic Output
No response
Important Factoids
The issue happens randomly when I'm updating an existing configuration. During the plan, it looks like it misses one of the policies from the plan and doesn't refresh it:
Then, it plans to add the "missing" policy:
And on the apply, it fails with the message above because the policy already exists (see above).
I don't do manual changes to the infrastructure. And the issue is happening randomly. Sometimes everything is fine, sometimes not on the same configuration.
How to fix it?
References
No response
Would you like to implement a fix?
None