hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.81k stars 9.16k forks source link

aws_elasticache_replication_group timeout #8561

Closed Rajendra-Jonnalagadda closed 3 years ago

Rajendra-Jonnalagadda commented 5 years ago

Community Note

Terraform Version

Terraform v0.11.13
+ provider.aws v2.8.0
+ provider.null v2.1.2
+ provider.random v2.1.2

Affected Resource(s)

Terraform Configuration Files

resource "aws_elasticache_replication_group" "elascticache_replica" {
  availability_zones = ["us-west-2a", "us-west-2b"]
  replication_group_id          = "dev-cache"
  replication_group_description = "this is a redis for xyz team "
  number_cache_clusters         = "2"
  node_type                     = "cache.t2.small"
  automatic_failover_enabled    = "true"
  at_rest_encryption_enabled = "false"
  transit_encryption_enabled = "true"
  auth_token = "xxxxxxxxxxxxxxxxx"
  auto_minor_version_upgrade = "false"
  engine                 = "redis"
  engine_version         = "5.0.3"
  parameter_group_name   = "default.redis5.0"
  port                   = "6378"
  subnet_group_name      = "elasticache"
  #security_group_names     = ["dev-elasticache"]
  snapshot_arns = ["arn:aws:s3:::<bucket>/elasticache/"]
  maintenance_window     = "sun:05:00-sun:09:00"
  snapshot_window = "01:00-05:00"
  snapshot_retention_limit = "5"
  apply_immediately = "false"
  timeouts {
    create = "60m"
    delete = "60m"
    update = "60m"
  }
}

Expected Behavior

Wait for 60mins before timesout

Actual Behavior

Times out in less then 10 mins

module.application-infrastructure.applications.cip_redis.aws_elasticache_replication_group.elascticache_replica: Still creating... (8m20s elapsed)
module.application-infrastructure.applications.cip_redis.aws_elasticache_replication_group.elascticache_replica: Still creating... (8m30s elapsed)

Error: Error applying plan:

1 error(s) occurred:

* module.application-infrastructure.module.applications.module.cip_redis.aws_elasticache_replication_group.elascticache_replica: 1 error(s) occurred:

* aws_elasticache_replication_group.elascticache_replica: Error waiting for elasticache replication group (dev-cache) to be created: unexpected state 'create-failed', wanted target 'available'. last error: %!s(<nil>)

Steps to Reproduce

  1. terraform apply
  2. terraform plan
  3. terraform apply << triggers the problem
AndiDog commented 5 years ago

With Terraform, I bisected and found that transit_encryption_enabled = true alone seems to trigger this issue (even if auth_token isn't specified). For me, creation failure already happens after 30-40 seconds (no timeout).

Minimal example (tested in eu-central-1 region):

resource "aws_elasticache_replication_group" "redis" {
  engine                        = "redis"
  engine_version                = "5.0.4"
  replication_group_id          = "${var.redis_cluster_name}"
  replication_group_description = "nonempty"
  number_cache_clusters         = "3"

  node_type = "cache.t2.micro"

  transit_encryption_enabled = false # changing this to `true` will fail
  at_rest_encryption_enabled = true
}

My version:

Terraform v0.12.8
+ provider.aws v2.28.1
+ provider.helm v0.10.2
+ provider.kubernetes v1.9.0
+ provider.local v1.3.0
+ provider.null v2.1.2
+ provider.template v2.1.2

I managed to reproduce the create-failed state even in AWS Console, but wasn't exactly sure which value causes it. Didn't spend more time trying to figure it out via Console / AWS API as well... Just mind this might not be a Terraform-specific issue if my observation is true.

AndiDog commented 5 years ago

For me, it turned out the name must be shorter in order to work around state create-failed. Played around with toggling all other parameters I had, and came to the conclusion that

# Names have been changed 😏, but I used the same length as these strings:
replication_group_id = "aabbccdd-staging-abcdefg-cluster-00" # FAILS
replication_group_id = "abstag-abcdefg-cluster-00" # WORKS

Enough as workaround to get going.

alepeltier commented 5 years ago

Thanks, @AndiDog for the fix. Do you have any idea why this is happening though? I thought Elasticache allows up to 50 characters for cluster names. https://docs.aws.amazon.com/cli/latest/reference/elasticache/create-cache-cluster.html

AndiDog commented 5 years ago

No clou here. AWS engineers should be able to tell. One possibility that other objects' names become overlong even with replication group name shorter than 50 characters.

finferflu commented 4 years ago

This seems to still be an issue even with replication_group_id using a 19-characters-long string and with transit_encryption_enabled unset. Are there no updates on this?

EDIT: To give more background on this, I've been using version 1.x of the AWS provider until today and I've never seen this issue before (the last successful deployment was on the 5th of March, 2020, and I have not deployed this ever since). I have updated to version 2.70 today, and I'm still seeing this problem. Considering this issue was raised in May 2019, it looks like this might be a new issue due to a recent change in AWS itself.

gdavison commented 3 years ago

Hi everyone, a few comments on this issue:

  1. This is not encountering a timeout, though it does occur while Terraform is waiting for the ElastiCache Replication Group to finish creation. As the error message notes, the replication group is entering the state create-failed, which indicates an error.
  2. It is likely unrelated to the length of the replication_group_id, since the field can support up to 40 characters (https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Replication.CreatingReplGroup.NoExistingCluster.Classic.html#Replication.CreatingReplGroup.NoExistingCluster.Classic.API)

This is likely something happening internally on the AWS end. I'm going to close this issue. If you're still encountering this error, please check the ElastiCache "Events" tab or contact AWS Support, and open a new Issue if it is related to Terraform.

ghost commented 3 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!