hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.7k stars 9.55k forks source link

Staying within remote system API concurrency/rate/etc limits #28152

Open alexjurkiewicz opened 3 years ago

alexjurkiewicz commented 3 years ago

Current Terraform Version

0.14

Use-cases

Sometimes, providers have limitations, or the backend API has a limitation, regarding parallel modification of a certain resource.

Adding a lifecycle { parallelism = n } directive would help work around certain problematic cases.

For example in AWS, the Lambda API control plane has a rate limit of 15 req/sec. If you have the following code:

resource aws_s3_bucket_object {
  for_each = local.my_100_item_set
}
resource aws_lambda_function my_functions {
  for_each = local.my_100_item_set
  ...
}

And run this with -parallelism=20, you're going to have a bad time, because most of your Lambda API calls will fail due to rate limiting and the total stack creation time will be extended.

There are other use-cases. AWS AppSync and API Gateway both allow you to modify APIs using several child resource types. But modifying the same API from multiple copies of a for_each resource will fail with "concurrent execution" errors.

Attempted Solutions

The above issues are all technically issues with the provider. The AWS provider could be enhanced so only 15 Lambda API modification requests are made at once, and the AppSync & API Gateway resources could also serialise modifications to underlying APIs.

But even in major areas of major providers, this is not implemented. You can reliably create an AWS ConcurrentExecution error with the following code:

resource aws_s3_bucket default {
  name = mybucket
}
resource aws_s3_bucket_policy default {
  name = aws_s3_bucket.default.name
  policy = ...
}
resource aws_s3_public_access_block default {
  name = aws_s3_bucket.default.name
  ...
}

So relying on providers to fix this issue is never going to happen.

You can also work around this issue by using terraform apply -parallelism=1. This is also not ideal, because it is overly broad and will slow down management of your entire stack, when only one part is a problem.

Proposal

Add a parallelism option to lifecycle blocks. For example:

resource aws_s3_bucket_object {
  for_each = local.my_100_item_set
}
resource aws_lambda_function my_functions {
  for_each = local.my_100_item_set
  ...
  lifecycle {
    parallelism = 15
  }
}
resource aws_appsync_datasource default {
  for_each = local.my_100_item_set
  ...
  lifecycle {
    parallelism = 1
  }
}

The default value is null -- no limit.

apparentlymart commented 3 years ago

Thanks for sharing this use-case, @alexjurkiewicz!

This use-case (or rather, unavoidable technical constraint) makes sense to me, along with the assertion that providers should ideally be handling it as part of their remit to map from Terraform's "idealized" model to the specific constraints and requirements of a particular remote system.

However, I do have some reservations about the specific solution you proposed.

It seems like the user-configured design requires the user to carefully manage these parallelism values across the entire configuration to ensure that the total for a particular resource type doesn't exceed some value. That might be very challenging if some of the resources are declared in shared modules and thus not directly managed by the person who is writing the top-level configuration. By treating it as part of the definition of a resource type though (that is, somehow handling it inside the provider), it would work regardless of whether it's a single resource with 100 instances or 100 separate resources.

You said "relying on providers to fix this issue is never going to happen", but I'm not sure that it not being addressed yet is evidence that it cannot be addressed: just as with the Terraform Core team, the provider maintainer teams have considerably more potential work than they have time to do and so they have to prioritize, but that doesn't mean they couldn't prioritize something related to this if it would lead to a material improvement in Terraform usability. If you haven't already, it could be worth opening a issue in the AWS provider's repository to discuss this focused problem, which might lead either to a discussion about a suitable design to address it or might yield valuable information about why the provider team hasn't addressed it which I expect would be important to consider in a hypothetical generalized design too, particularly if it's being blocked by a technical constraint.

We typically only implement generic (Terraform Core-implemented) solutions -- particularly ones that require language changes -- if we can see that their benefits will generalize across many use-cases. With that said, I'd like to use this issue as a place to gather more information about other systems with request rate or concurrency limits, which we can firstly use to see what the impact would be of working on this (for prioritization purposes) and also hopefully learn more about what an appropriate generalized design would be, so we can address as many problems as possible with as little new language surface area as possible.

In the meantime, if you need a solution to this specific problem with Lambda then a change to the AWS Provider's implementation of these resource types is likely to be the most expedient solution, because the design effort for that can be focus on one specific set of requirements.

Thanks again!

apparentlymart commented 3 years ago

From some limited initial research to see if this was already discussed by the AWS provider team, I found hashicorp/terraform-provider-aws#17828 which isn't actually about limiting concurrency but does suggest that the particular limit mentioned in the use-case here can vary per-account+per-region and so handling it automatically would probably require the provider to access GetAccountSettings to learn the appropriate concurrency limit to enforce, or else to pessimistically choose the lowest known value which I assume would apply to folks using the free tier.

As we collect other examples of rate-limited or concurrency-limited APIs here, it would be helpful to also know if the limits involved are static (the same for everyone at all times) or variable (either per-account/per-region like AWS Lambda or other variations like time-of-day differences).

alexjurkiewicz commented 3 years ago

Thanks for the feedback. It's really comprehensive, you obviously put in a lot more thought than my rush 4:55pm issue writing :)

You are right, at the resource level, we can't set limits perfectly. Or even set them vaguely well. Customising parallelism per-resource is too unwieldy.

I still think some "escape hatch" is warranted here. Even if the provider "knows" about these restrictions and implements a lock to prevent issues, this knowledge can't be exposed to Terraform's planning engine, which will lead to sub-optimal apply order as resources sit in "creating" while the provider really idles waiting fr a semaphore to come free. An escape hatch would be useful in a similar way totime_sleep, data external, depends_on, etc. These are never the preferred tool. But most practitioners will find a use-case for their power every now and then.

My updated proposal is a lifecycle directive that acts as syntactic sugar. It lets you add linear dependency chain between resources in a for_each/count. For example:

resource "aws_lambda_function" "default" {
  for_each = toset(["a", "b"])
  ...
  lifecycle { serialise_execution = true }
}

Is equivalent to:

resource "aws_lambda_function" "a" {
  ...
}
resource "aws_lambda_function" "b" {
  ...
  depends_on = [aws_lambda_function.a]
}

As to your questions about rate limits. Most AWS API limits are not published as numbers, and have changed over time. Lambda's control plane rate limit is relatively unusual in being fixed and publicly known.

Most AWS API rate limits are flexible if you pay enough. Some services will let any account adjust limits (S3, CloudWatch, SSM). But others require spend over x/month.

The AppSync "modifications must be serialised" restriction (ditto S3 and a few other services) is not a "limit" per se and therefore static.

jsm commented 2 years ago

Hi, I'd like to add another data point here.

AWS currently cannot handle updating multiple capacity providers in a cluster at the same time. It would be nice to be able to force serialization of updating the aws_ecs_capacity_provider resource.

Currently, this causes the application to fail, and I have to reapply terraform multiple times to get all capacity providers successfully updated.

From AWS Support:

This error occurs when another attachment such as another ECS Capacity Provider is being updated, the new update will fail. The solution is to update one attachment / Capacity Provider at a time to ensure the cluster or attachments do not get into a corrupted state.