MSK Rolling Upgrade Continuously Retries if Partition Count > MSK Limit

james-bjss commented 3 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

terraform -v
Terraform v0.14.5
+ provider registry.terraform.io/hashicorp/aws v3.20.0
+ provider registry.terraform.io/hashicorp/template v2.2.0

Affected Resource(s)

aws_msk_cluster

Terraform Configuration Files

resource "aws_msk_cluster" "example" {
  cluster_name           = "example"
  kafka_version          = "2.4.1" # After creating more partitions than the Upgrade limit change to 2.5.1 and reapply
  number_of_broker_nodes = 3
  ...

Debug Output

Gist with relevant logs

Expected Behavior

The apply should fail early indicating that the upgrade can't be performed due to the High Partition count.

Actual Behavior

The PUT call to /v1/clusters/clusterArn/version fails with a HTTP 429 X-Amzn-Errortype: HighPartitionCountException TF output reports that it is retrying (x25).

Steps to Reproduce

Deploy MSK Cluster with kafka_version="2.4.1.1" via TF
Create Topics and Partition exceeding the Upgrade limits on brokers: (See Limits on Upgrade endpoint)
Update kafka_version to 2.5.1 and apply to trigger upgrade

Important Factoids

Debug shows that the call to /v1/clusters/clusterArn/version returns a HTTP 429 (429 Too Many Requests)
The endpoint responds with X-Amzn-Errortype: HighPartitionCountException however I am not sure a 429 code is the correct code in this instance, so this could be an issue on the AWS API side.
Due to the HTTP 429 Code TF will continuously retry the call

There may be an argument to say it should retry if the partition count drops, but in my opinion I would rather the apply fail early with an indication of the actual error . In theory TF is honoring the 429 response by retrying, but should it?

References

https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html#bestpractices-right-size-cluster https://docs.aws.amazon.com/msk/1.0/apireference/clusters-clusterarn-version.html#clusters-clusterarn-versionput https://docs.aws.amazon.com/msk/latest/developerguide/limits.html

Farzad-Jalali commented 3 years ago

I got the exact same problem!

james-bjss commented 3 years ago

Have also raised this to AWS support who have escalated to MSK Team, to confirm if the 429 response is expected behavior.

james-bjss commented 3 years ago

Update on the above. AWS MSK team are reviewing the 429 response code and may remediate this, but no dates have been given.

marcincuber commented 2 years ago

@james-bjss any updates on this?

james-bjss commented 2 years ago

@james-bjss any updates on this?

Hi @marcincuber - Unfortunately I never got a response back from AWS support on this. It was passed on to the MSK team and the ticket closed. In theory it could be handled in the provider by checking for the specific header it returns, but not sure if the team would want to put the workaround in code.

Have you had this issue recently? I haven't retested so it's entirely possible it could have been resolved upstream

marcincuber commented 2 years ago

@james-bjss I haven't tested it. However, I will be starting work on Kafka this week. This is an interesting issue that you mentioned here so I will definitely check whether I can reproduce.

Pekinek commented 10 months ago

I also contacted AWS support about this issue and changing 429 to something else was added to their backlog - no ETA though.

"Thank you for providing the change request. I have added this to the backlog and it will be prioritized accordingly."

hashicorp / terraform-provider-aws