hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.81k stars 9.16k forks source link

aws_elasticache_replication_group always recreates when cluster_mode replicas_per_node_group is zero #4817

Closed b-dean closed 3 years ago

b-dean commented 6 years ago

When a aws_elasticache_replication_group uses cluster mode and the replicas_per_node is 0, the resource will want to recreate every time.

Community Note

Terraform Version

$ terraform version
Terraform v0.11.7
+ provider.aws v1.22.0

Affected Resource(s)

Terraform Configuration Files

variable "size" {
  default = "1"
}

resource "aws_elasticache_replication_group" "foo" {
  replication_group_id          = "foo"
  replication_group_description = "foo"
  node_type                     = "cache.r3.large"
  automatic_failover_enabled    = "${var.size > 1}"

  cluster_mode {
    num_node_groups         = 1
    replicas_per_node_group = "${var.size - 1}"
  }
}

Debug Output

https://gist.github.com/b-dean/2746b1e47ff03544e3c6e92fcf877af6

Expected Behavior

Second terraform plan should show no changes needed

Actual Behavior

The plan shows:

-/+ aws_elasticache_replication_group.foo (new resource required)
      id:                                     "foo" => <computed> (forces new resource)
      apply_immediately:                      "" => <computed>
      at_rest_encryption_enabled:             "false" => "false"
      auto_minor_version_upgrade:             "true" => "true"
      automatic_failover_enabled:             "false" => "false"
      cluster_mode.#:                         "0" => "1"
      cluster_mode.0.num_node_groups:         "" => "1"
      cluster_mode.0.replicas_per_node_group: "" => "0" (forces new resource)
      configuration_endpoint_address:         "" => <computed>
      engine:                                 "redis" => "redis"
      engine_version:                         "3.2.10" => <computed>
      maintenance_window:                     "fri:08:30-fri:09:30" => <computed>
      node_type:                              "cache.r3.large" => "cache.r3.large"
      number_cache_clusters:                  "1" => <computed>
      parameter_group_name:                   "default.redis3.2" => <computed>
      primary_endpoint_address:               "foo.27yrhh.ng.0001.use1.cache.amazonaws.com" => <computed>
      replication_group_description:          "foo" => "foo"
      replication_group_id:                   "foo" => "foo"
      security_group_ids.#:                   "0" => <computed>
      security_group_names.#:                 "0" => <computed>
      snapshot_window:                        "06:00-07:00" => <computed>
      subnet_group_name:                      "default" => <computed>
      transit_encryption_enabled:             "false" => "false"

Plan: 1 to add, 0 to change, 1 to destroy.

Steps to Reproduce

  1. terraform apply
  2. terraform plan

Important Factoids

We have the variable size because in our development environments we don't want a bunch of costly replicas, whereas in production we might add a -var size=5. I thought that when size is greater than 1, we don't see this behavior of the cluster_mode showing the wrong values from the refresh, but I just ran it with -var size=2 and it wanted to recreate on the second apply. Same thing with the cluster_mode information missing.

jeremygaither commented 6 years ago

Also seeing this when not set to zero. cluster_mode.# always seem to change to 0.

-/+ aws_elasticache_replication_group.iavs (new resource required)
      id:                                     "REDACTED" => <computed> (forces new resource)
      apply_immediately:                      "" => <computed>
      at_rest_encryption_enabled:             "false" => "false"
      auto_minor_version_upgrade:             "true" => "true"
      automatic_failover_enabled:             "true" => "true"
      cluster_mode.#:                         "0" => "1"
      cluster_mode.0.num_node_groups:         "" => "1"
      cluster_mode.0.replicas_per_node_group: "" => "2" (forces new resource)
      configuration_endpoint_address:         "" => <computed>
      engine:                                 "redis" => "redis"
      engine_version:                         "2.8.24" => "2.8.24"
      maintenance_window:                     "wed:05:00-wed:06:00" => <computed>
      node_type:                              "cache.m4.large" => "cache.m4.large"
      number_cache_clusters:                  "3" => <computed>
      parameter_group_name:                   "default.redis2.8" => "default.redis2.8"
      primary_endpoint_address:               "REDACTED use1.cache.amazonaws.com" => <computed>
      replication_group_description:          "REDACTED" => "REDACTED"
      replication_group_id:                   "REDACTED" => "REDACTED"
      security_group_ids.#:                   "0" => <computed>
      security_group_names.#:                 "0" => <computed>
      snapshot_retention_limit:               "3" => "3"
      snapshot_window:                        "00:00-05:00" => "00:00-05:00"
      subnet_group_name:                      "REDACTED" => "REDACTED"
      transit_encryption_enabled:             "false" => "false"
kiwivogel commented 6 years ago

Same. This is currently a blocking issue for something we want to setup in production. terraform: 0.10.8 provider: 1.26.0

-/+ module.env.module.elasticache-ha.aws_elasticache_replication_group.redis (new resource required)
      id:                                     "REDACTED" => <computed> (forces new resource)
      apply_immediately:                      "" => <computed>
      at_rest_encryption_enabled:             "true" => "true"
      auth_token:                             <sensitive> => <sensitive> (attribute changed)
      auto_minor_version_upgrade:             "true" => "true"
      automatic_failover_enabled:             "true" => "true"
      cluster_mode.#:                         "0" => "1"
      cluster_mode.0.num_node_groups:         "" => "1"
      cluster_mode.0.replicas_per_node_group: "" => "1" (forces new resource)
      configuration_endpoint_address:         "" => <computed>
      engine:                                 "redis" => "redis"
      engine_version:                         "4.0.10" => "4.0.10"
      maintenance_window:                     "fri:22:30-fri:23:30" => "sun:00:00-sun:03:00"
      member_clusters.#:                      "2" => <computed>
      node_type:                              "cache.m3.large" => "cache.m3.large"
      number_cache_clusters:                  "2" => <computed>
      parameter_group_name:                   "default.redis4.0" => <computed>
      port:                                   "6379" => "6379"
      primary_endpoint_address:               "REDACTED.euw1.cache.amazonaws.com" => <computed>
      replication_group_description:          "Replication group for redis elasticache" => "Replication group for redis elasticache"
      replication_group_id:                   "REDACTED" => "REDACTED"
      security_group_ids.#:                   "1" => "1"
      security_group_ids.1286157895:          "REDACTED" => "REDACTED"
      security_group_names.#:                 "0" => <computed>
      snapshot_window:                        "04:30-05:30" => <computed>
      subnet_group_name:                      "REDACTED" => "REDACTED"
      transit_encryption_enabled:             "true" => "true"

This was working earlier with 0 replicas_per_node_group and 2 num_node_groups

inspection of the tfstate shows that "cluster_mode.#": "0", is what ends up in the statefile. This is likely the cause.

Edit: I was wrong, statefile says the following:

                            "cluster_mode.#": "1",
                            "cluster_mode.4199676665.num_node_groups": "1",
                            "cluster_mode.4199676665.replicas_per_node_group": "1"

It's really strange to me that TF incorrectly goes looking for information at index 0 for this information.

kiwivogel commented 6 years ago

Did some more digging. Problem went away after I reverted the TF provider to 1.12.0

casalewag commented 6 years ago

@kiwivogel this worked for me as well, thank you very much.

provider "aws" {
  version = "1.12.0"
  }

Can we confirm that this is being worked on? I would prefer to not specify an older version of AWS provider when this will be running in production.

kiwivogel commented 5 years ago

@casalewag @jeremygaither @b-dean I figured it out! It's not a bug. It works as it should. The issue here is that amazon returns "ClusterEnabled": "false" when terraform wants to compare the state with whatever's running on AWS. This is caused by using the default parameter group. The default parameter group has ClusterEnabled set to false. Solvable by passing a parameter group that has ClusterMode enabled (either a default one or a custom one). This sadly will force you to recreate the resource as it's not something you can change after creation. Hope this helps. I'll make a PR to update the documentation to warn users about this.

kiwivogel commented 5 years ago

@casalewag @jeremygaither @b-dean additionally, if you feel safe doctoring tfstate files (disclamer, this is generally speaking a very bad idea) that you can actually "fix" this by removing the cluster_mode block from the state file and your resource definition and replacing it with number_cache_clusters = "${var.size}" or total number of nodes in the resource definition, the statefile already has the correct number. Additionally @radeksimko this issue should probably be closed because it's not a bug but a configuration issue.

robbiet480 commented 4 years ago

Thanks @kiwivogel this fixed it for me!

For those looking for a quick fix, if you're using Redis Cluster Mode you want to use default.redis5.0.cluster.on as the parameter group name.

gdavison commented 3 years ago

When configuring more cluster_mode. num_node_groups to greater than 1, a parameter_group which supports "Cluster Mode" is needed. The default parameter groups that support "Cluster Mode" end in .cluster.on.

If you are still encountering this error, please create a new issue.

ghost commented 3 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!