aws_msk_cluster provisioned_throughput block has multiple issues

ryan-dyer-sp commented 2 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

Terraform v1.1.4 on linux_amd64

provider registry.terraform.io/hashicorp/aws v4.23.0

Affected Resource(s)

aws_msk_cluster

Terraform Configuration Files

resource "aws_msk_cluster" "cluster" {
...
    storage_info {
      ebs_storage_info {
        volume_size = var.ebs_volume_size
        dynamic "provisioned_throughput" {
          for_each = var.ebs_volume_throughput == "0" ? [] : ["enabled"]
          content {
            enabled           = true
            volume_throughput = var.ebs_volume_throughput
          }
        }
        dynamic "provisioned_throughput" {
          for_each = var.ebs_volume_throughput != "0" ? [] : ["disabled"]
          content {
            enabled = false
          }
        }
      }
    }
  }
}

Actual Behavior

The existing behavior seems to only support two configurations.

Not having throughput enabled in AWS and not having the provisioned_throughput block in code
enabling/enabled throughput with provisioned_throughput block

Attempts to disable provisioned throughput require that the block no longer contain the volume_throughput field; thus the weird double dynamic blocks in example above.

Once disabled terraform plan constantly detects a chance in the plan as the p_t block doesnt appear to exist as part of the state. How simple removal of the block is not sufficient to disable p_t in the case that is already enabled on the cluster.

Steps to Reproduce

Create MSK with provisioning enabled
attempt to disable provisioning by simply setting enabled to false and leaving volume_throughput field as is. This fails as volume_throughput cannot be set when setting enabled to false. This is detected at apply time, not at plan time.
Attempt to disable provisioning by removing the block entirely. Plan and apply "succeed" however no changes are actually made to the cluster. It is simply removed from state and subsequent plans continue to show a change is needed.
Disable provisioning by setting enabled = false and no volume_throughput. This succeeds though subsequent terraform plans will show that the p_t block still needs to be added and set to enabled=false.

Important Factoids

References

AdamTylerLynch commented 2 years ago

Hello and thank you for reporting an issue(s)!

I’m having a hard time understanding the intended outcome, and how that intended outcome is not being met. Would it be possible for you to provide some clarity in the usecase? This will help us build an acceptance test that fails, and then we can use that for remediation.

An example:

1) Starting with a single aws_msk_cluster, with a single EBS drive with multiple volumes, and no provisioned_throughput: <sample HCL>

2) Attempt to add provisioned_throughput to all volumes: <sample HCL>

3) Thing X happens. Expected Y to happen.

<errors>

ryan-dyer-sp commented 2 years ago

ideally the following 3 configs would all behave the same way.

<no config>

provisioned_throughput {
  enabled = false
}

and

provisioned_throughput {
  enabled = false
  volume_throughput = X
}

All three would force disablement of provisioned throughput and subsequent terraform plans using any of the three configurations when the throughput is disabled on the MSK would result in no changes necessary

Right now, \ seems to simply not make any changes and for historical perspective of people who have enabled the config on clusters without TF, having disable their config can be rude. So if that one doesnt change, I can live with it.

However, the second should not constantly result in terraform plan showing needed changes.

And ideally the third would know to filter out the throughput into the API request so that people can simply enable/disable throughput via basic code of p_t { enabled = var.enabled v_t = var.v_t } and not need to do the double dynamic block as I put in the ticket in order to capture the different syntaxes that are needed for that block depending on whether its enabled or disabled

sean-rossignol commented 2 years ago

My team is also facing this issue. We support enabling provisioned_throughput but set provisioned_throughput.enabled to false by default with no additional parameters specified. so our default configuration looks like example 2 outlined in the above comment by @ryan-dyer-sp :

provisioned_throughput {
  enabled = false
}

This default configuration seems to work fine with new clusters which are stood up after we added support in our internal msk module for setting the provisioned_throughput block. But using this newer version of our msk module against an existing cluster results in never ending drift since terraform is expecting the above configuration block to be specified on the target cluster and it is missing after every apply.

We can address this in our msk module by reverting to a version of the module before we supported this parameter but I would like to understand how the provider expects a disabled provisioned_throughput configuration block to be constructed so it works with new and existing clusters. Could be the answer is it doesn't support it at this time and if that is the case then I would like to understand when support of this use case will be added.

MaxymVlasov commented 1 year ago

Starting from aws provider 5.0, this bug not ignored as it was in was provider v4, but produce error below

 ~ update in-place

Terraform will perform the following actions:

  # aws_msk_cluster.msk will be updated in-place
  ~ resource "aws_msk_cluster" "msk" {
        id                       = "arn:aws:kafka:us-east-1:433464930109:cluster/spoton-msk/dafaafc8-297a-410c-8acc-7903dfd668a7-12"
        tags                     = {}
        # (10 unchanged attributes hidden)

      ~ broker_node_group_info {
            # (4 unchanged attributes hidden)

          ~ storage_info {
              ~ ebs_storage_info {
                    # (1 unchanged attribute hidden)

                  - provisioned_throughput {
                      - enabled           = false -> null
                      - volume_throughput = 0 -> null
                    }
                }
            }

            # (1 unchanged block hidden)
        }

        # (4 unchanged blocks hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

aws_msk_cluster.msk: Modifying... [id=arn:aws:kafka:us-east-1:433464930109:cluster/spoton-msk/dafaafc8-297a-410c-8acc-7903dfd668a7-12]
╷
│ Error: updating MSK Cluster (arn:aws:kafka:us-east-1:433464930109:cluster/spoton-msk/dafaafc8-297a-410c-8acc-7903dfd668a7-12) broker storage: BadRequestException: The request does not include any updates to the EBS volumes of the cluster. Verify the request, then try again.
│ {
│   RespMetadata: {
│     StatusCode: 400,
│     RequestID: "469ec42e-adb6-4733-871e-0a0c10b37939"
│   },
│   Message_: "The request does not include any updates to the EBS volumes of the cluster. Verify the request, then try again."
│ }
│ 
│   with aws_msk_cluster.msk,
│   on msk.tf line 100, in resource "aws_msk_cluster" "msk":
│  100: resource "aws_msk_cluster" "msk" {
│ 
╵

I tried manually fix state-file, rm and import resource, but nothing help

jasonstitt commented 1 year ago

This issue with provisioned_throughput was entirely blocking for me until ignoring the section:

  lifecycle {
    ignore_changes = [
      broker_node_group_info[0].storage_info[0].ebs_storage_info[0].provisioned_throughput
    ]
  }

msarfaty commented 1 year ago

YMMV but I got this working without lifecycle/ignore_changes by removing the dynamic block which conditionally create it and instead just do

provisioned_throughput {
    enabled           = local.provisioned_throughput_enabled
    volume_throughput = local.provisioned_throughput_enabled ? local.volume_throughput : null
}

Note that for me it didn't work with s/null/0 because it fails the provider's schema validation. The explicit setting of provisioned_throughput/enabled = false seems to do the trick. Omitting the block caused issues like others have noted.

mgusiew-guide commented 11 months ago

I also hit this issue when introducing this feature. I ended up with following variable declaration:

variable "provisioned_storage_volume_throughput" {
  description = <<EOF
                Specifies whether provisioned throughput is turned on and the volume throughput target.
                For backwards compatibility, it set to null by default.
                The feature can be enabled by setting enabled flag to true and specifying volume throughput.
                The volume throughput rate between EC2 instance and EBS volume is specified in MiB per second.
                To use this configuration brokers must be of type kafka.m5.4xlarge or larger and
                storage volume must be 10 GiB or greater.
                The minimum throughput value is 250, the maximum depends on the broker type and can be found in AWS MSK documentation.
                Once enabled, it is possible to disable the feature by setting enabled flag to false and volume throughput to null.
                Note that changing the provisioned_storage_volume_throughput from null to object with enabled set to false or the other way around
                results in runtime error.
EOF

  type = object({
    enabled    = bool
    throughput = number
  })
  validation {
    condition = (var.provisioned_storage_volume_throughput == null ||
      (try(var.provisioned_storage_volume_throughput.enabled, false) == false && try(var.provisioned_storage_volume_throughput.throughput, null) == null) ||
      (coalesce(try(var.provisioned_storage_volume_throughput.enabled, false), false) && coalesce(try(var.provisioned_storage_volume_throughput.throughput, 0), 0) >= 250)
    )

    error_message = "The provisioned_storage_volume_throughput.throughput requires a minimum value of 250 if provisioned_storage_volume_throughput.enabled is set to true, it should be set to 0 when provisioned_storage_volume_throughput.enabled is set to false."
  }

  default = null
}

and the code:

...
ebs_storage_info {
  volume_size = var.broker_node_ebs_volume_size_gb
  dynamic "provisioned_throughput" {
    iterator = provision_throughput_iterator
    for_each = var.provisioned_storage_volume_throughput == null ? [] : [var.provisioned_storage_volume_throughput]
    content {
      enabled           = provision_throughput_iterator.value.enabled
      volume_throughput = provision_throughput_iterator.value.throughput
    }
  }
}
...

In order to smoothly migrate existing clusters, the default variable value is null, I tried object with enabled set to false but it fails at runtime.

The nasty thing is to explain valid transitions for "provisioned_storage_volume_throughput". Following transitions work:

null -> {enabled = true, throughput = ...}
{enabled = true, throughput = ...} -> {enabled = false, throughput = 0}
{enabled = false, throughput = null} -> {enabled = true, throughput = ...}

Other transitions will fail, in particular:

null -> {enabled = false, throughput = 0}
{enabled = true, throughput = ...} -> null

I could probably experiment with pre and postconditions to improve it, but I am on Terraform 1.0.x at the moment so this is not yet an option.

All in all, I agree that the provider should be more forgiving and treat null in the same way as {enabled = false, throughput = null}, this would make it possible to simplify the migration and minimise the valid states/transitions.

P.S. The reason why I am setting throughput to 0 when disabling is to keep Terratest code simple, from TF perspective it could be null or missing when enabled is set to false.

vrchen commented 10 months ago

"The request does not include any updates to the EBS volumes of the cluster. Verify the request, then try again."

My workaround for this was to increase volume_size marginally (minimum change is 10GB) which meant there was at least one change under the MSK cluster resource (resource "aws_msk_cluster" "msk") along side the false -> null no-op. This helped me to avoid using lifecycle/ignore_changes since we needed those changes to be updatable in the future.

Update: Unfortunately, despite being able to apply the {enabled = false, throughput = 0} => null, null change with a 10GB bump to volume_size, the same diff is still coming up. So I'm going back to using lifecycle/ignore_changes

ayoh1 commented 3 months ago

we are seeing this in our plans and after doing a search looks like there is no fix yet but ignoring changes?

hashicorp / terraform-provider-aws