hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.73k stars 9.09k forks source link

MSK Rolling Upgrade Continuously Retries if Partition Count > MSK Limit #17332

Open james-bjss opened 3 years ago

james-bjss commented 3 years ago

Community Note

Terraform CLI and Terraform AWS Provider Version

terraform -v
Terraform v0.14.5
+ provider registry.terraform.io/hashicorp/aws v3.20.0
+ provider registry.terraform.io/hashicorp/template v2.2.0

Affected Resource(s)

Terraform Configuration Files

resource "aws_msk_cluster" "example" {
  cluster_name           = "example"
  kafka_version          = "2.4.1" # After creating more partitions than the Upgrade limit change to 2.5.1 and reapply
  number_of_broker_nodes = 3
  ...

Debug Output

Gist with relevant logs

Expected Behavior

The apply should fail early indicating that the upgrade can't be performed due to the High Partition count.

Actual Behavior

The PUT call to /v1/clusters/clusterArn/version fails with a HTTP 429 X-Amzn-Errortype: HighPartitionCountException TF output reports that it is retrying (x25).

Steps to Reproduce

  1. Deploy MSK Cluster with kafka_version="2.4.1.1" via TF
  2. Create Topics and Partition exceeding the Upgrade limits on brokers: (See Limits on Upgrade endpoint)
  3. Update kafka_version to 2.5.1 and apply to trigger upgrade

Important Factoids

There may be an argument to say it should retry if the partition count drops, but in my opinion I would rather the apply fail early with an indication of the actual error . In theory TF is honoring the 429 response by retrying, but should it?

References

https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html#bestpractices-right-size-cluster https://docs.aws.amazon.com/msk/1.0/apireference/clusters-clusterarn-version.html#clusters-clusterarn-versionput https://docs.aws.amazon.com/msk/latest/developerguide/limits.html

Farzad-Jalali commented 3 years ago

I got the exact same problem!

james-bjss commented 3 years ago

Have also raised this to AWS support who have escalated to MSK Team, to confirm if the 429 response is expected behavior.

james-bjss commented 3 years ago

Update on the above. AWS MSK team are reviewing the 429 response code and may remediate this, but no dates have been given.

marcincuber commented 2 years ago

@james-bjss any updates on this?

james-bjss commented 2 years ago

@james-bjss any updates on this?

Hi @marcincuber - Unfortunately I never got a response back from AWS support on this. It was passed on to the MSK team and the ticket closed. In theory it could be handled in the provider by checking for the specific header it returns, but not sure if the team would want to put the workaround in code.

Have you had this issue recently? I haven't retested so it's entirely possible it could have been resolved upstream

marcincuber commented 2 years ago

@james-bjss I haven't tested it. However, I will be starting work on Kafka this week. This is an interesting issue that you mentioned here so I will definitely check whether I can reproduce.

Pekinek commented 10 months ago

I also contacted AWS support about this issue and changing 429 to something else was added to their backlog - no ETA though.

"Thank you for providing the change request. I have added this to the backlog and it will be prioritized accordingly."