aws_elasticache_replication_group logic does not permit automatic_failover_enabled when num_of_clusters = 1

DoctorPolski commented 3 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

Terraform v0.14.8

provider registry.terraform.io/hashicorp/aws v3.34.0

Affected Resource(s)

aws_elasticache_replication_group

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

resource "aws_elasticache_replication_group" "elasticache_redis" {
  replication_group_id          = "${var.environment}-redis"
  replication_group_description = var.redis_replication_group_description
  node_type                     = var.redis_instance_type
  automatic_failover_enabled    = var.redis_automatic_failover_enabled
  multi_az_enabled              = var.redis_multi_az_enabled
  auto_minor_version_upgrade    = var.redis_auto_minor_version_upgrade
  availability_zones            = data.aws_subnet.redis[*].availability_zone
  at_rest_encryption_enabled    = var.redis_at_rest_encryption_enabled
  transit_encryption_enabled    = var.redis_transit_encryption_enabled
  engine_version                = var.redis_engine_version
  parameter_group_name          = var.redis_parameter_group_name
  port                          = var.redis_port
  subnet_group_name             = aws_elasticache_subnet_group.mds_elasticache_redis_subnet_group.name
  maintenance_window            = var.redis_maintenance_window
  snapshot_window               = var.redis_snapshot_window
  snapshot_retention_limit      = var.redis_snapshot_retention_limit

  security_group_ids = [
    aws_security_group.elasticache_redis_sg.id,
  ]

  cluster_mode {
    replicas_per_node_group = var.redis_replicas_per_node_group
    num_node_groups         = var.redis_num_node_groups
  }
}

Expected Behavior

automatic_failover_enabled = true
multi_az_enabled           = false

cluster_mode {
  replicas_per_node_group = 0
  num_node_groups         = 1
}

With the above config, Terraform should plan.

Actual Behavior

Error: if automatic_failover_enabled is true, number_cache_clusters must be greater than 1

Steps to Reproduce

terraform apply

Important Factoids

This working redis config is currently in place on AWS and has been for well over a year. This was configured via Terraform v0.12 with an obviously much earlier version of the provider.
Ever since the release of provider v3.26.0 Terraform will no longer plan despite this config not having changed.
The config suggestion made by @gdavison in https://github.com/hashicorp/terraform-provider-aws/issues/17376#issuecomment-808596790 is already in place as illustrated above so is of no help here.
As mentioned here https://github.com/hashicorp/terraform-provider-aws/issues/17376#issuecomment-771174117 and here https://github.com/hashicorp/terraform-provider-aws/issues/17376#issuecomment-778522264 the issue remains that Terraform is insisting this:

If automatic_failover_enabled is enabled, number_cache_clusters must be greater than 1

Source: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/elasticache_replication_group#automatic_failover_enabled

Which does not agree with the working config I currently have on AWS: small_env

So I have to question whether or not the assertion made by Terraform is correct. It is my belief that number of clusters DOES NOT need to be > 1 for automatic_failover_enabled to be TRUE.

I agree with @gdavison here https://github.com/hashicorp/terraform-provider-aws/issues/17605#issuecomment-808600211 that it makes no logical sense setting automatic_failover_enabled to true for a single node since indeed where would it failover to. But AWS permits this and therefore it is incorrect for Terraform enforce otherwise.

I have Serverless code dependencies on cluster mode being set so I cannot specify my single node environments without clustering mode enabled.

References

17605
17376
17320

choikijin-techcross commented 3 years ago

I have same problem.

https://github.com/hashicorp/terraform-provider-aws/issues/17605#issuecomment-808600211 https://github.com/hashicorp/terraform-provider-aws/issues/17376#issuecomment-808596790

I saw these 2 comments from related issues, so I feel like the developer of this code predicts that this terraform code works.

automatic_failover_enabled = true
multi_az_enabled           = false

cluster_mode {
  replicas_per_node_group = 0
  num_node_groups         = 1
}

unfortunately, I can't create smallest redis-cluster set @gdavison mentioned. I hope it'll be fixed soon.

gdavison commented 3 years ago

For clarification, is part of the issue that you haven't directly configured number_cache_clusters, yet the error message references it? If so, I can try to improve the error message. We currently allow configuring non-cluster mode replication groups using either number_cache_clusters or cluster_mode to make development and production configurations easier. number_cache_clusters == cluster_mode.num_node_groups * (1 + cluster_mode.replicas_per_node_group).

For your existing replication group, can you add a screenshot of the current parameters? In the list of replication groups, click on the triangle to show this. I'm wondering if there's an unexpected configuration, especially since AWS changed how some settings are interpreted.

choikijin-techcross commented 3 years ago

@gdavison I didn't configure number_cache_clusters and the error message is Error: if automatic_failover_enabled is true, number_cache_clusters must be greater than 1.

My setting is same as @DoctorPolski, so

cluster_mode.num_node_groups = 1, cluster_mode.replicas_per_node_group = 0, number_cache_clusters == 1 * (1 + 0) = 1

and it's not greater than 1. So I think the calculation is working, but terraform doesn't allow me to use a smallest set of a cluster.

I didn't add number_cache_clusters in my code, but terraform plan is showing number_cache_clusters on the result. Maybe terraform is putting number_cache_clusters internally.

In below terraform plan result doesn't show the change of replicas_per_node_group because I already removed my replica on my aws console.

------------------------------------------------------------------------

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.redis_cluster.aws_elasticache_replication_group.redis_cluster will be updated in-place
  ~ resource "aws_elasticache_replication_group" "redis_cluster" {
        arn                            = "arn:aws:elasticache:ap-northeast-1:xxxxxx:replicationgroup:xxxxxx-redis-cluster"
        at_rest_encryption_enabled     = false
        auto_minor_version_upgrade     = true
      ~ automatic_failover_enabled     = true -> false
        cluster_enabled                = true
        configuration_endpoint_address = "xxxxxx-redis-cluster.gfqrwu.clustercfg.apne1.cache.amazonaws.com"
        engine                         = "redis"
        engine_version                 = "4.0.10"
        id                             = "xxxxxx-redis-cluster"
        maintenance_window             = "tue:05:00-tue:06:00"
        member_clusters                = [
            "xxxxxx-redis-cluster-0001-001",
        ]
        multi_az_enabled               = false
        node_type                      = "cache.t2.small"
        number_cache_clusters          = 1
        parameter_group_name           = "xxxxxx-redis-cluster-params"
        port                           = 6379
        replication_group_description  = "xxxxxx-redis-cluster"
        replication_group_id           = "xxxxxx-redis-cluster"
        security_group_ids             = [
            "sg-981aa5e0",
        ]
        security_group_names           = []
        snapshot_retention_limit       = 5
        snapshot_window                = "14:00-15:00"
        subnet_group_name              = "xxxxxx-redis-subnet-group"
        tags                           = {
            "AppRole" = "redis-cluster"
        }
        transit_encryption_enabled     = false

        cluster_mode {
            num_node_groups         = 1
            replicas_per_node_group = 0
        }
    }

Plan: 0 to add, 1 to change, 0 to destroy.

------------------------------------------------------------------------

$ grep -rn number_cache_clusters modules/elasticache/redis_cluster/

PS. I changed my terraform plan result to my last try to resolve this problem, and like as you think, this terraform code failed to apply because I changed my automatic_failover_enabled to false.

DoctorPolski commented 3 years ago

Thanks for engaging on this Graham.

The error message is a little confusing because effectively it only directly references the number_cache_clusters method for provisioning clusters. Honestly good luck thinking up a concise error message that references both...

These are the params for the cluster referenced in my OP: small_env_params

I believe we may share the same suspicion that since AWS made their changes that brought about the need for your v3.26.0 update that my config may effectively be no longer possible to create? Could it therefore be that I and others have been left with an orphaned legacy cluster config as a result of those changes?

The only way to test this theory directly would be to attempt to recreate a similar cluster config from scratch using AWS console or CLI. Even if that turned out to be possible the mystery of being able to set automatic_failover_enabled to TRUE for a single node with nothing to failover to would still remain.

In either case I think a support request to AWS is required here in parallel to our investigations.

gdavison commented 3 years ago

I'll take a look at improving the error messaging. Part of the complication is that we allow setting either, number_cache_clusters or cluster_mode, and calculate the other. When the parameter validations are done, it's picking up the calculated number_cache_clusters.

I'll do some more investigation around the configuration and try to reproduce it. It theoretically should not be possible, at least now, unless there's also some interaction with "Cluster-Mode enabled"/sharding.

Right now, if the validation is turned off, the AWS API returns an error, but I haven't tried it with a cluster.on parameter group.

DoctorPolski commented 3 years ago

I have tested the creation of a "single node cluster" directly on the console and your suspicions are correct @gdavison, cluster.on does indeed make all the difference.

This is my creation template:

small_create

Selecting the Cluster Mode enabled Redis sub-option dynamically updates the Parameter group setting to include the cluster.on suffix.

Setting Number of Shards to 1 and Replicas per Shard to 0 forcibly unsets the Multi-AZ option and produces the blue info ribbon. We know the provider currently accounts for this as per previous discussion in this issue and related threads.

Resulting cluster:

small_cluster

In respect of clustering, this matches the config of my "small environments".

gdavison commented 3 years ago

Thanks for testing that, @DoctorPolski. That's not the result I expected… 😀

ghost commented 3 years ago

This has been released in version 3.36.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

ghost commented 3 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

hashicorp / terraform-provider-aws