terraform-aws-modules / terraform-aws-msk-kafka-cluster

Terraform module to create AWS MSK (Managed Streaming for Kafka) resources πŸ‡ΊπŸ‡¦
https://registry.terraform.io/modules/terraform-aws-modules/msk-kafka-cluster/aws
Apache License 2.0
55 stars 53 forks source link

Random number generator is not updating configuration resource name on create before destroy #27

Closed dabmajor closed 2 months ago

dabmajor commented 4 months ago

Description

When upgrading an MSK cluster to a new version, the 2.5.0 version of this terraform module wants to replace the MSK configuration resource. The create before destroy is required because you cannot destroy the configuration for an active cluster. However, your random number generator does not always generate a unique name, so the create before destroy fails.

See this line of code in your module: https://github.com/terraform-aws-modules/terraform-aws-msk-kafka-cluster/blob/master/main.tf#L259

Versions

Expected behavior

Create before destroy should create a new aws_msk_configuration with a new/unique name using a random number generator, every single time it is ran.

Actual behavior

After the first successful recreation, the create before destroy on this resource creates a new aws_msk_configuration with the same random number generated on the first resource creation run. Causing terraform to fail.

Terminal Output

.
.
  # module.msk["REDACTED"].module.msk.aws_msk_configuration.this[0] must be replaced
+/- resource "aws_msk_configuration" "this" {
      ~ arn             = "arn:aws:kafka:REDACTED:REDACTED:configuration/REDACTED" -> (known after apply)
      ~ id              = "arn:aws:kafka:REDACTED:REDACTED:configuration/REDACTED" -> (known after apply)
      ~ kafka_versions  = [ # forces replacement
          - "3.4.0",
          + "3.5.1",
        ]
      ~ latest_revision = 1 -> (known after apply)
        name            = "REDACTED-123456789"        <-- notice the name is not changing
    }
.
.
.
Error: creating MSK Configuration: operation error Kafka: CreateConfiguration, https response error StatusCode: 409, RequestID: ********* ConflictException: A resource with this name already exists.

Possible Solutions

Random number generator was a great idea, but it only worked for the first recreation. You could ensure the value is always unique by doing something like this:

locals {
  config_as_string = join(",", [for key, value in var.configuration_server_properties : "${key}=${value}"])
  unique_config_suffix = substr(sha256("${local.config_as_string}${var.kafka_version}", -8, -1)
}

name = format("%s-%s", coalesce(var.configuration_name, var.name), local.unique_config_suffix)

Doing this, I believe any time either of these variables change, not only would it trigger the resource recreation, it would also be generating a completely unique 8 character value. I didn't test this, but I think something like this would work. You may also need to add one additional value to the hash, so you don't end up with hash collisions when two configs have the same name and the same configuration values.

dabmajor commented 4 months ago

For anyone currently trying to upgrade their clusters that is block by this, you can still finish your MSK upgrade following these steps:

  1. create a temporary msk configuration with the new version (copied from the target cluster)
  2. manually upgrade your cluster using the temporary configuration
  3. delete the previously conflicting configuration
  4. re-run the failed terraform
  5. verify that the replacement configuration was created applied to the target cluster
  6. delete your temporary configuration

Note: this process is only correct if you are not using MSK replicator to another region.

hameno commented 3 months ago

Same problem here. I wonder if the aws provider resource is wrong here. I upgraded via UI and it reused the same configuration. So why is kafka_versions a change that requires recreate?

bryantbiggs commented 3 months ago

@GreggSchofield ☝🏽

dabmajor commented 3 months ago

I have written a working solution in my own environment. Had to fork this repo to get it working. I agree that the provider itself likely could handle it differently, then we wouldn't have to code around it. I've also found MSK itself lacks some maturity, where api/cli limitations are manifesting themselves into what we perceive as a provider limitation.

On Thu, May 23, 2024, 5:49β€―AM Bryant Biggs @.***> wrote:

@GreggSchofield https://github.com/GreggSchofield ☝🏽

β€” Reply to this email directly, view it on GitHub https://github.com/terraform-aws-modules/terraform-aws-msk-kafka-cluster/issues/27#issuecomment-2126914106, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE54WQKBPQZYT5QSUZMN5HTZDXJUDAVCNFSM6AAAAABF7RJXM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRWHEYTIMJQGY . You are receiving this because you authored the thread.Message ID: <terraform-aws-modules/terraform-aws-msk-kafka-cluster/issues/27/2126914106 @github.com>

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

github-actions[bot] commented 2 months ago

This issue was automatically closed because of stale in 10 days

github-actions[bot] commented 1 month ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.