hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.57k stars 9.54k forks source link

Why dynamic providers would be such an important feature at scale. #31069

Closed gtmtech closed 2 years ago

gtmtech commented 2 years ago

Current Terraform Version

All, incl. latest

Background

Apologies for the rather large writeup - but thanks for taking the time to read it!

I work on an enormous hybrid cloud platform, - it consists of 3 clouds, hundreds of accounts/projects, multiple environments, regions and so on... Some simple terraform installations would not have a need for a solution in this space, but if you are a B2B provider, or an internal service provider to a very large company (e.g. 50,000 employees+), you are likely to hit the difficulty with how to structure all of your terraform in a way that scales well.

Terraform best practice dictates that a single terraform run (and thus state file) should manage a moderate, but not excessive number of resources - perhaps 100 is a reasonable amount of terraform resources per state file. At some point as you scale out, you have to make decisions on how to split resources across multiple terraform state files.

In a large multi-account AWS setup for example, you might make a reasonable decision to start splitting state across account boundaries, especially if a lot of AWS accounts are very similar and have similar resources set up (cloudtrail, networks, configs, iam etc.)

This also seems reasonable because in order to terraform across 2 accounts in AWS, you need 2 providers. A provider is intimately tied to a set of credentials and a context with which to access the AWS API, and those set of credentials or that context stipulates the target AWS account you are terraforming in.

Providers support part-parameterisation, so you can inject things like the role_arn, or this or that credential and you have parameterised code that can be run against each of your AWS accounts, and this results in a statefile per account. This is the typical way I have seen companies use terraform through my client engagements.

As that scales up though, you get further problems. Instead of 2 accounts, suppose you are terraforming 1000. That means 1000 terraform runs, and that means you need an automated orchestrator to keep them all in sync, as its too big a task for a human - step in things like terragrunt, or some enterprise offerings to help out.

However whilst those sort the running of lots of terraform runs, another problem starts to creep in. Developers tend to work on features, not on accounts, and terraform itself may not run completely as desired across all 1000 accounts due to engineer fault, network timeouts, aws api rejections, race conditions. Some AWS resources tend to take a very long time to instantiate and change - such as AWS Microsoft Active Directory at around 50 minutes. During such a change, all 1000 accounts are potentially "locked out" from other feature development. In a large team of engineers with a lot of features to change, waiting for hours for terraform runs to finish is not great.

So you might think to split out certain products from other products and create separate runs for them so they dont lock out the entire state for others whilst updating. Now instead of 1000 statefiles, you have 2000, and then 4000 and so on.

For each split, you may well need to introduce dependencies in your orchestrator to make sure that X happens before Y - e.g. the IAM permissions are set up before the resources that need them - or the networks are set up before the loadbalancers.

Pretty soon, the orchestrator dependencies is also creating a huge workflow that takes hours to resolve as well - So if my CI system has to run 8000 terraform runs, and there is an orchestrator dependency chain, even with some very well provisioned CI servers, I may be waiting hours for a run. In a large team and estate, this starts to cripple your productivity.

I was thinking about how best to solve some of these issues, when I realised that the main reason why the design choices had led down this path, was because there was no real way within terraform to intelligently and dynamically operate across a large number of AWS accounts. And the reason for that is because there is no support for dynamic providers.

Whilst providers can reference attributes, the number of providers is always fixed in the terraform configuration, and this means that if you want to operate on 1000 accounts in a single terraform run, you will need 1000 provider blocks. Or if you want to operate on each account using 2 IAM roles, you will need 2000 provider blocks.

Managing these provider blocks could be done by simply hardcoding them to each aws-account-id or each role-id - and maintaining a lot of these in your repository. For example, you could have a provider.account1.tf for each account, in there specify 2 provider blocks with hardcoded values for each of your roles you want to use in the respective account.

But these providers need passing through to the modules in the correct way, so you also need a module block that is going to call a module with a different set of providers for each account as well. And as with every module, you need to pass through all the variables needed too.

This works, its just a lot of files and blocks to maintain. It does however allow a different model at operating at scale, which is to now get terraform to do work across all accounts in a single terraform run, and that can be aligned to a specific feature. For example you could have a "cloudtrail" module which sets up cloudtrail in every account (in the days before the cloudtrail AWS Organizations feature as an example). You could imagine that you want these 1000 cloudtrails stored in a bucket too in a different account, so your single terraform run/statefile contains the code to create the cloudtrail bucket in some audit account with associated policies, and then create all the cloudtrails across all accounts, feeding them into the bucket, all as one feature.

This would seem pretty neat! Now with statefiles split along feature lines, because developers tend to naturally work on features and not on accounts, this aligns the developer iteration with the code iteration. Different features can be worked on in isolation (just as different developers work on them in isolation).

Onboarding new accounts and offboarding old accounts is a little trickier, as every feature needs running to onboard/offboard them in the respective account, but use of -target since each account's resources is a "module", is easy, so that works too.

It's a bit ugly though to maintain 1000s of provider blocks and associated module blocks, and the interactions between them (the dependencies that may be required between modules, like in the case above between the module doing the account which contains the cloudtrail bucket, and the module doing the accounts which contain the cloudtrail).

It would be a very nice feature to be able to iterate through a list of accounts, and generate provider blocks and associated module blocks , which means provider{} and module{} should both be made to support for_each and all the supporting ecosystem. Currently only module{} has been made to support it.

I managed to get such a system up and running using terragrunt doing the dynamic generation of provider blocks and module blocks - I put a POC here if anyone is interested: https://github.com/gtmtechltd/terragrunt-poc - it works and I was surprised at how much I loved being able to terraform a multi-account feature with different aspects in different accounts in one terraform run.

However I turn up my nose a little at such metaprogramming - programs to write programs - it's all just a bit... hard to read. I'd much rather this was supported directly in terraform. Also terragrunt doesn't really have support for datasources itself, and the generation phase has to come before any terraform runs happen, so you can't do something even cooler, which would be to query the aws_organizations object as a datasource, get all the accounts, and then just create all the dynamic providers from that, and apply your stuff everywhere.

Appreciate 99% of your userbase are probably just terraforming a small infra in dev/stage/prod setup - and we do have all sorts of terraform enterprisey things going on, but I thought it would be good to write up as a general feature request the reasons why I think dynamic providers would really help out at scale and offer some genuinely great alternatives to organising workloads in ways which are small and isolated and aligned with developer workflow.

If the terraform team have any other ideas and best practices about how to divide resources along statefile boundaries which lend themselves well to operating a large cloud estate with a large team at scale, I'd be really interested to hear from experience in the field.

Attempted Solutions

See https://github.com/gtmtechltd/terragrunt-poc for a workaround using terragrunt

Proposal

Allow provider blocks to support foreach, and be able to be dynamically made according to other datastructures (which could be just vanilla variables, rather than derived from datasources, but the icing on the cake would be if they could be made from the output of datasources too)

References

TBC

crw commented 2 years ago

Hi @gtmtech, first of all thank you very much for this write-up! It is very appreciated and has indeed been read by most of the team.

This seems to be a duplicate report to https://github.com/hashicorp/terraform/issues/19932 -- ideally, we would like for this to be posted as a comment to that issue and close this issue as a duplicate. The earlier issue is being tracked and has the historical discussion, and thus would be the ideal place to have this comment. Do you see any problems with that? Please let me know. Thanks again for this really thorough report, we do appreciate it.

gtmtech commented 2 years ago

Thanks, I've crossposted on #19932

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.