Feature Request: "Sequential" Resource / Data Source Flag

Module version

v1.5.0

Use-cases

Certain resources/APIs do not like "concurrent" operations, where the API is used at the same time as other operations (or in "too close" succession).

An example of this is the GCP Networking Peering API: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_network_peering, which can be found to fail when multiple peering operations against a project are being performed.

These API errors can either be put through provider specific retry logic, or just results in operation failure if the API errors have that result in the resource.

Attempted Solutions

Generally the "workarounds" for this I've found involve:

Using depends_on and references to enforce a order of operations, this only works well for "related" resources, for example the peering resource, which requires configuration of an A side then a configuration of a B side.
Reducing concurrency in the entire operation, which is unfortunate when only a small percentage of resources being used have this issue
Manually ordering/managing additions to configuration Most of these are not "ideal" in that they require attentive effort on the part of the operator to get it correct OR they require abusing the depends_on and reference system in a way which is not semantically correct (meaning we create dependencies which are not "real" in practical terms).

Plugin / Provider Specific:

Catching the concurrent API call errors (if possible), and doing waiting/retries on them. (Only works if the upstream API reports them in a consistent way)

Proposal

For APIs where we know this limitation in usage exists, it would be nice if the Resource schema supported marking these resources (or data sources) as "Sequential" (or pick your favorite term for this).

This is something I would expect to be added to the Resource Schema and configured by a provider developer (either in plugin-sdk-v2 or framework-sdk).

The results of this flag would be that the Terraform provider would not concurrently perform operations on resources of that type, globally within the full configuration. However, that would only apply to those specific resources, and standard rules would apply to everything else. This would have to be smart enough to sequentially order those resources globally and also not try to perform operations on resources that depend on those resources until they've gone through the process.

Ultimately, the "why" is to make a more semantic and standard mechanism for enforcing "sequential" operations on APIs which require it, rather than using workarounds or "provider specific" workaround which may be inconsistent and vary in quality.

Example of implementation expectation:

From https://github.com/hashicorp/terraform-provider-scaffolding-framework/blob/main/internal/provider/example_resource.go

func (r *ExampleResource) Schema(ctx context.Context, req resource.SchemaRequest, resp *resource.SchemaResponse) {
    resp.Schema = schema.Schema{
        // This description is used by the documentation generator and the language server.
        MarkdownDescription: "Example resource",

                // EXAMPLE IMPLEMENTATION: This would mark this as "Sequential"
                Sequential: True,

        Attributes: map[string]schema.Attribute{
            "configurable_attribute": schema.StringAttribute{
                MarkdownDescription: "Example configurable attribute",
                Optional:            true,
            },
            "defaulted": schema.StringAttribute{
                MarkdownDescription: "Example configurable attribute with default value",
                Optional:            true,
                Computed:            true,
                Default:             stringdefault.StaticString("example value when not configured"),
            },
            "id": schema.StringAttribute{
                Computed:            true,
                MarkdownDescription: "Example identifier",
                PlanModifiers: []planmodifier.String{
                    stringplanmodifier.UseStateForUnknown(),
                },
            },
        },
    }
}

References

There's discussion here about solving a problem like this using a provider-managed sync.Mutex.

Would that solve the problem without requiring any changes to the schema?

If something like Sequential were to be implemented in the framework schema, my vote would be for a string (or []string) rather than a bool which names a shared mutex so that resources of different types could be de-conflicted with each other:

func (r *ExampleResourceA) Schema(ctx context.Context, req resource.SchemaRequest, resp *resource.SchemaResponse) {
        resp.Schema = schema.Schema{
                Sequential: []string{"ExampleResource A and B cannot run concurrently"},
        }
}

func (r *ExampleResourceB) Schema(ctx context.Context, req resource.SchemaRequest, resp *resource.SchemaResponse) {
        resp.Schema = schema.Schema{
                Sequential: []string{
                        "ExampleResource A and B cannot run concurrently",
                        "ExampleResource B and C cannot run concurrently",
                },
        }
}

func (r *ExampleResourceC) Schema(ctx context.Context, req resource.SchemaRequest, resp *resource.SchemaResponse) {
        resp.Schema = schema.Schema{
                Sequential: []string{"ExampleResource B and C cannot run concurrently"},
        }
}

*danger! deadlock potential with the []string approach :)

There's discussion here about solving a problem like this using a provider-managed sync.Mutex.

Would that solve the problem without requiring any changes to the schema?

If something like Sequential were to be implemented in the framework schema, my vote would be for a string (or []string) rather than a bool which names a shared mutex so that resources of different types could be de-conflicted with each other:
func (r *ExampleResourceA) Schema(ctx context.Context, req resource.SchemaRequest, resp *resource.SchemaResponse) {
        resp.Schema = schema.Schema{
                Sequential: []string{"ExampleResource A and B cannot run concurrently"},
        }
}

func (r *ExampleResourceB) Schema(ctx context.Context, req resource.SchemaRequest, resp *resource.SchemaResponse) {
        resp.Schema = schema.Schema{
                Sequential: []string{
                        "ExampleResource A and B cannot run concurrently",
                        "ExampleResource B and C cannot run concurrently",
                },
        }
}

func (r *ExampleResourceC) Schema(ctx context.Context, req resource.SchemaRequest, resp *resource.SchemaResponse) {
        resp.Schema = schema.Schema{
                Sequential: []string{"ExampleResource B and C cannot run concurrently"},
        }
}
*danger! deadlock potential with the []string approach :)

You're example isn't quite what I'm trying to solve for:

To prevent concurrent operations of the SAME resource (aka multiple instances of resource A in the same root module).

Also while yes each provider can implement something on their own, the reason I'm asking for a addition to the framework SDK is to avoid each provider needing to reinvent the wheel for a similar class of problems, and the varying quality of those inventions I've seen in the providers that do use them (such as dramatically altering performance, or inconsistent functionality of these workarounds).

Hi @tgoodsell-tempus 👋 Thank you for this feature request and @chrismarget-j for contributing to the discussion. Our main attention is elsewhere at the moment (finishing up Terraform 1.8 features such as provider-defined functions and the ability to state move across resource types), but to keep the discussion going here, there are a few upfront considerations here which make choosing how to expose this sort of functionality harder than it might seem on the surface. This is not to say that these details/decisions are insurmountable, but any design or potential implementation proposal should account for at least these:

Whether this should be implemented in the "core" framework code, or offered as an "extension" Go module, similar to terraform-plugin-framework-timeouts, terraform-plugin-framework-(json|net|time)types, etc. One of the general design principles of the framework is that its abstractions should be loosely based off existing Terraform Plugin Protocol functionality and extensible enough so developers can still implement use case or nice-to-have abstractions that fall outside of that. If this functionality is only on the provider side of the protocol, which this proposal as written would be, then it likely falls more under that extensibility heuristic. More concretely, it should be possible for any developer today to implement a custom Go type built on top of the existing resource.Resource type that does what any framework implementation might potentially do. In fact, I would actually recommend doing something like this as part of an implementation proposal. It is a great way to test how it works in practice and could potentially be exposed for other developers to try/use.
Whether practitioners should be able to override this functionality via Terraform configuration. APIs change over time while provider implementations may not immediately. This introduces a larger amount of implementation complexity. Conversely, should a practitioner be able to do this themselves without much/any provider code implementation?
Whether this should be or support a multiple concurrent operations. APIs might allow up to a certain amount of concurrency rather than strictly a single operation at once.
Whether support for dynamically determined concurrency rules may be necessary, e.g. based on provider configuration or API lookup. Certain APIs have differing concurrency rules based on customer/location.
Which Terraform Plugin Protocol RPCs these concurrency rules should be applied. There are multiple Terraform -> Provider operations that occur against a single resource implementation, e.g. with "API access" PlanResourceChange vs ApplyResourceChange vs ImportResourceState. Should these rules be further granular to the framework abstractions, e.g. Create vs Read vs Update vs Delete? Should these rules be further granular to particular API calls themselves, which is ultimately the source of these concurrency rules? It is not guaranteed that a restricted API call only occurs within one resource. Will it be more or less frustrating to have a concurrency solution that is partially detached from the problematic calls? What if a resource implements a concurrency restricted API call followed by concurrency safe API calls (e.g. a restricted, but quick "creation" call, followed by polling calls to wait until its ready)?
If resource based, whether data sources should also be considered.

Ideally, it would be the API SDKs themselves that would implement this sort of concurrency handling as they are the closest to the API implementation details causing the issues. In practice though, this does tend to be a problem left for API consumers to figure out, which is quite unfortunate. The prior terraform-plugin-sdk (v1 only 😉 ) implemented a mutexkv package as a "baked in" solution so provider developers could create key-specific mutex, e.g. have a mutex per API call. While this was ultimately removed because there are a other Go concurrency libraries that exist with similar functionality, the design was flexible enough to enable concurrency handling in spite of the nuance of many of considerations above.

I'm going to stop here because that was quite a lot of words on this topic, but we are happy to discuss this more in detail in the future. Thanks again for raising this.

You're example isn't quite what I'm trying to solve for:

To prevent concurrent operations of the SAME resource (aka multiple instances of resource A in the same root module).

I think "a provider-managed sync.Mutex" can deconflict multiple instances of the same resource (though I'm not clear on what you thought I was suggesting ... deconflict a single instance from ... itself?) Having the ability to deconflict multiple resource types is just a bonus.

You'd create the mutex in the provider's Configure() method, and make it available (via pointer) to all instances of the resource via the resource Configure() methods. This way, each instance of the resource waits for its turn talking to the API.

Hi. I'm in this same boat. I'm building a Terraform Provider with the Terraform Provider Framework for a Bot Defender Platform. The API of the Bot Defender has a Race Condition where two Bot Defender Rules created at the same time can prevent the other from being created. To work around this I have been using a Rate Limiter to limit the create operation to one create every two seconds.

hashicorp / terraform-plugin-framework