hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io
Other
42.77k stars 9.56k forks source link

Systems that need a "commit" step for actions spanning across multiple resources #30045

Open davedotdev opened 2 years ago

davedotdev commented 2 years ago

I've been building providers for networking use cases, in which the remote network operating system has commit capabilities. The network operating systems themselves create a transaction of sorts held in ephemeral memory, until a commit RPC occurs. The providers in this operating mode, send CRUD RPCs to the network operating system and then once the local state reflects the ephemeral remote state, a 'commit' is required which is a separate RPC call to push the transaction in to non-volatile memory and in to operation on the network element.

To solve the commit on apply and on destroy, each provider has two resources, a commit and a destroycommit, which are used as dependencies in HCL. That solves the commits required for those two operations, but when it comes to updates, like removing one or more resources, or modifying them, my current option is to taint the commit resource and run apply. Not very user friendly and likely for operators to make mistakes.

My proposal is that a feature called auto_taint is exposed through HCL, used in the same manner as depends_on.

resource "provider" "resource_name_1" {
  resource = "name1"
  key1 = "thing"
  auto_taint = [provider.resource_name_2]
}

The auto_taint would taint the resource if any of the resources identified change.

In the networking ecosystem, any networking operating system that operates with the notion of a commit would benefit from this. This would work at a resource and module level, similar to depends_on.

Happy to provide more details and I hope this is clear.

For transparency I work for Juniper Networks and I'm talking about Junos and Junos Evolved operating systems. However, Palo Alto and some Cisco operating systems have this requirement.

apparentlymart commented 2 years ago

Thanks for sharing this use-case, @davedotdev!

I think you might be describing the same underlying need that we were discussing earlier over in #8099. Would you agree?

If so, since we were explicitly using that issue to gather mention links from elsewhere this comment will already achieve that, but we can also think about whether this issue is representing a distinct use-case or if it's a restatement of the same use-case, and thus whether to keep this issue open or to incorporate your new example into the existing issue.

Thanks again!

apparentlymart commented 2 years ago

There's also some relevant discussion on this in #3716, where I was navigating a similar design challenge for a provider I was trying to develop as an outside contributor, before I joined HashiCorp.

Although that issue is long-closed, I mention it here because it contains some context about the challenges of representing APIs that have an explicit "commit" action separate from staging the actual changes to objects, and it might be helpful to refer back to that in case someone wanted to think about other ways Terraform might address this situation more explicitly in its workflow, rather than requiring weird extra arguments to work around the incompatible lifecycle.

davedotdev commented 2 years ago

Thanks for coming back so quickly @apparentlymart! Much appreciated.

Wow. Some of those threads are huge!! Thanks for the information though, great to read. #8099 seems to cover it quite well. I read #3716 and one issue I have with encapsulating the commit logic in updates is, there could be more than one resource update, with dependencies on the remote data store that will error if multiple commits are sent out. Eventually the remote state would converge (assuming the resource content is accurate), but it would look messy from a logging perspective and I've tried hard to avoid 'normal errors'. Also, network operating systems with large configurations can take an eternity to converge (post commit), so multiple convergence attempts should be avoided.

I believe the networking industry would benefit hugely from this use case and it might be worth polling other vendors. I can officially speak on behalf of Juniper Networks and state it would absolutely benefit all of our customers using Terraform (I'm driving Juniper's Terraform work currently).

Just to explain the issue a little more from my perspective, here is a blog post I published earlier. Other vendors I know are struggling with the incompatible life-cycle between Terraform and remote commit based data-store convergence. It's all relatively straight forward but it might help shed light specifically on how we're approaching the coupling of Terraform to a commit based transaction system.

apparentlymart commented 2 years ago

Thanks for sharing that blog post, @davedotdev! It seems like you've encountered many of the same challenges I ran into with the Fastly provider, though your solution seems to have ended up in a more favorable place than a single giant resource type representing the entire surface of the API. :disappointed_relieved:

We've made some attempts to design around this requirement a few times, but haven't yet landed on a solution we liked enough to move forward with it. I've not thought about it for a while though, so I have some distance from those previous efforts and don't have all of the context loaded in my brain right now.

I believe the two main semi-requirements we were imposing on ourselves were roughly these:

It sounds like the solution you've adopted works around Terraform's unawareness of this workflow by having the user explain the required ordering to Terraform using depends_on, which does seem like a fine approach for today's Terraform, but I'd love to find an approach where there's less room for user error.

The other concern, which this issue was discussing, is that updates also require committing and there isn't really any great way to fake that using resource types. The replace_on_change thing discussed in that other issue, or the auto_taint thing you discussed here (which I think are different names for essentially the same thing) is in a similar vein to the explicit depends_on workaround, where you can achieve a correct order of operations but the user needs to write the configuration just so in order for it to actually work, and likely won't get good feedback if they miss something.

I want to be up front with you that the stuff I've discussed above is something we'd previously put on the back-burner because there wasn't a clear path forward and we decided to focus on other problems where it was clearer what was needed. For that reason, we likely won't be able to dive right in to addressing this, but if you'd be willing then I'd love to adjust this issue to represent having a first-class modelling of "commit" actions in Terraform, and leave the existing #8099 to represent the "forced replace by configuration" use-case.

bflad commented 2 years ago

A potential alternative in this case may be to offer providers a "CloseProvider" RPC, which would be a node executed after all other nodes, but prior to the StopProvider RPC node used for exiting the provider process. This would not be designed to solve for partial/progressive apply functionality, batching, or generic transaction/grouping handling in configurations, but could offer an interim solution where the provider could inject some final logic (e.g. committing or actively closing a session) without requiring additional configuration and likely without interfering those other potential future capabilities. There is a lot to consider in the cases of no-operation plans and error handling though.

See also: https://github.com/hashicorp/terraform-plugin-sdk/issues/63

My team was planning on discussing a more well-defined proposal across the CLI, protocol, and provider frameworks on this soon.

Aside: One caveat (of many) this type of alternative wouldn't account for is giving providers the direct ability to properly "start" a commit/transaction during an apply, if necessary. The ConfigureProvider RPC node is there and available for providers to implement logic, but it is executed often and outside of applies with actual changes (e.g. plan, refresh) so it likely is not a good candidate. Having yet more RPC(s) that are executed at proper times during certain graph walks, likely falls under the broader design work mentioned previously.

davedotdev commented 2 years ago

Thanks @bflad for the extra info.

Knowing the CloseProvider RPC is available is great for gracefully closing transport sessions (as would the StopProvider), but as you say, doesn't really help any other use case around committing a transaction.

What would be interesting however, is if you could pass in the terraform operation as meta data into the CloseProvider RPC. That way the provider logic could understand if the operation was an apply/update/destroy etc and act accordingly. Having access to the terraform operation would allow us to handle commits with CloseProvider entirely.

In the apply workflow, the module/s would get executed, with the resources having been constructed in the ephemeral remote transaction data store. Once the remote resources were read back and verified as being representative of local state, the CloseProvider RPC will triggered and the provider would see it's the end of an apply (thanks to the TF operation data) and would issue a commit.

In the destroy workflow, the resources inside the module would get destroyed remotely and the CloseProvider RPC would tell the provider it's the end of a destroy and to commit.

In the update workflow, any resources getting modified or deleted would be handled appropriately and the CloseProvider RPC would trigger a commit at the end of mutation.

That would mean we could remove entirely the explicit commit handling in HCL and concentrate purely on the resource logic. That seems like a great solution for a Terraform cycle. The only thought is, if there are multiple invocations of different modules and different providers against the same remote data store, the provider designer would need to implement the ability to turn off commits with a simple key in the provider config so that the issue of a commit is down to the HCL creator/operator/designer. That's not a big deal (I think?).