terraform-google-modules / terraform-google-project-factory

Creates an opinionated Google Cloud project by using Shared VPC, IAM, and Google Cloud APIs
https://registry.terraform.io/modules/terraform-google-modules/project-factory/google
Apache License 2.0
835 stars 538 forks source link

Null resources are an anti-pattern #373

Closed umairidris closed 4 years ago

umairidris commented 4 years ago

The preconditions script tries to preventitively check for pre-requisites to avoid running into errors around permissions, unenabled apis, etc.

However, this is an anti-pattern in Terraform. First, if a user does not have access to a resource or an API is not enabled, then the apply command will fail and the GCP api will return a clear error indicating so. This is how every resource and vast majority of modules work (including CFT's other modules).

For APIs, the required APIs can just be enabled automatically if they are free and disable_on_destroy set to false to avoid them being disabled in the future if the module is removed/changed.

The preconditions script also uses provisioners. Per Terraform docs the first heading on the provisioners page is "Provisioners are a Last Resort":

However, they also add a considerable amount of complexity and uncertainty to Terraform usage.

We do not recommend using provisioners for any of the use-cases described in the following sections. Even if your specific use-case is not described in the following sections, we still recommend attempting to solve it using other techniques first, and use provisioners only if there is no other option.

And none of the recommended use cases fit provisioners needs.

Preconditions script also uses Python, which adds extra complexity and dependencies in the underlying system. When something goes wrong it is not clear why without significant overhead to investigate (see https://github.com/terraform-google-modules/terraform-google-project-factory/issues/372).


Based on the arguments above, I recommend one of the following:


There is one genuinely very useful check done by the script currently, which is a check for access to the billing account being used. I have raised this issue with the provider team to see how this can be fixed in the provider directly.

morgante commented 4 years ago

I'm happy to have a discussion on this, but please do some background research on the many failures which have happened before.

See these issues to start: https://github.com/terraform-google-modules/terraform-google-project-factory/blob/master/docs/TROUBLESHOOTING.md#common-issues

Terraform will not fail cleanly in cases where preconditions are not met. You can easily be left in a catastrophic and unrecoverable state where the project is half-created but cannot be. (In fact, based on your email thread, I'm pretty sure you've already encountered this.)

Preconditions exist for a reason to prevent you from getting to that state. This is the only module we have it on because creating projects is the most likely to encounter these.

umairidris commented 4 years ago

Following up after offline discussion: It would be ideal if the google_project resource was able to use partial state on creation to get around a project only partially being created. If that is possible then preconditions script would no longer be needed. Need to follow up with provider team on this.

If partial state on creation is not possible then the next best alternative is to move as many checks to the provider as possible. Billing account is one thing that could be easily checked there before the project create call is made. However in this case the precondition script should still be kept as checks like API checks are not possible in the provider.

Whether or not a skip preconditions bool is added I will leave up to the maintainers, but it could be a good option to add with a default to false (i.e. run preconditions by default but allow user to explicitly bypass if they wish).

kuwas commented 4 years ago

I'm also of the belief that these gcloud cli and python calls cause more problems than they solve. Almost every time I run into an issue with the project-factory or kubernetes-engine modules, during create or destroy, its because of these null_resource local-execs.

It would make a lot more sense to add additional code into terraform-google-provider to solve these gaps in functionality. Terraform just does not work well when interacting with things outside of the provider.

umairidris commented 4 years ago

Agreed @kuwas. After discussing offline the current script is important in preventing irrecoverable states but hopefully we can resolve this in the provider. I have begun engaging the provider team on internal forums around this.

cc @danawillow for posterity

By gcloud I am guessing you mean the calls that get made to disable/delete the default compute engine service account? It is useful but I agree a non-gcloud solution would be ideal (perhaps along side the auto_create_network field in google_project we could add a new default_service_account field?). Though I am sure we would want to solve the existing issues with google_project before adding even more stuff in there :)

I do wish instead of being default to 'disable' that it defaulted to 'keep'. That would give users an option to use the functionality if needed while reducing the provisioners being run on the default path (even if 'disable' is a better default in principle). Not a severe issue though as it's easy enough to just set the field to whatever value you want.

morgante commented 4 years ago

We'd definitely prefer a non-gcloud approach, but unfortunately the Terraform resource module is ill-suited to many of the things we want to accomplish. The main cases we have where gcloud is needed:

  1. Deleting a resource which was automatically created outside Terraform (ex. deleting the default SA)
  2. Updating/editing a resource which was created outside Terraform (ex. updating DNS masq on Kubernetes)

That being said, I don't think it's impossible. There is some precedent for adding resources which partially update/manage a cloud resource (ex. iam_member resources), though this is a bit different. @danawillow I'd be curious on your thoughts here.

Unfortunately given the current state this is the best compromise we have.

danawillow commented 4 years ago

Both of those bullet points also have precedent in the provider. For deleting a resource created outside of Terraform, we have:

For updating/editing a resource which was created outside Terraform:

For GKE there are a few open issues that I know we're starting to take seriously and plan out how to make them happen safely: https://github.com/terraform-providers/terraform-provider-google/issues/1480 and https://github.com/terraform-providers/terraform-provider-google/issues/2373 are the ones that come to mind- I don't think there are any others open at the moment.

Every single one of these examples is something we would prefer to just exist naturally in the API, but it's not always a priority for the product teams. Granted, we'd also have to prioritize them against our other open FRs, but I wouldn't discount them completely from getting in the provider.

morgante commented 4 years ago

@danawillow Thanks for chiming in, glad to hear you're open to adding these to the provider directly.

One question: do you think it's better to add these as parameters on an existing resource (ex. google_project.auto_create_network) or to create a new resource? Thus far the pattern seems to be to include this behavior in the parent resource, but I fear that ends up with very complicated resources which have a higher potential for error. I wonder if isolating the behavior into dedicated resources could be better, even if it's not idiomatic Terraform.

I'll try to take a stab at converting these over this quarter, if you're open to it.

rileykarson commented 4 years ago

I wonder if isolating the behavior into dedicated resources could be better, even if it's not idiomatic Terraform.

My concern with an approach like that is that it's hard to build the right dependency graph. It's natural to reference a google_project directly, but with fine-grained resources for these kinds of changes users would really want to reference google_project_remove_default_service_account to ensure that unmanaged children were deleted.

morgante commented 4 years ago

My concern with an approach like that is that it's hard to build the right dependency graph. It's natural to reference a google_project directly, but with fine-grained resources for these kinds of changes users would really want to reference google_project_remove_default_service_account to ensure that unmanaged children were deleted.

That's true that it is slightly less intuitive but presumably any such resources will have a project_id attribute as well. In fact, I might argue it's actually clearer to depend on google_project_remove_default_service_account if you're doing something which can only work after that's happened.

umairidris commented 4 years ago

For longer term solutions is this something we can engage the core Terraform team in? I can't imagine multiple steps on creation being something unique to the google provider...

The problematic behaviour seems to be the 'tainting' of the resource when an error is returned in creation after the id is set. So as long as we can signal to Terraform after the id is set to stop tainting the resource on any following errors we should be OK.

danawillow commented 4 years ago

If/when the core/sdk team allows providers to emit warnings to users instead of the only states being success and failure, then that would be another way of solving the problem. From my understanding, that's a long ways out though.

We've engaged them in the past for other issues wrt taint (https://github.com/hashicorp/terraform/issues/21652), but that specific issue ended up getting worked around in a different way. We can certainly bring this to their attention though.

mhamrah commented 4 years ago

Just want to echo the removal of python and gcloud calls. We are ramping up on TF on GCP and would like to use Terraform Cloud, which has a limited runner where python and gcloud is not installed by default. Although this project handles gcloud to a certain extent, other projects in the terraform-google-modules org do not.

Possibly Hashicorp including these as a default in Terraform cloud would help, or a better set of lower-level apis (or in the Google provider) to accomplish goals in this repo without the need for gcloud would help.

And thanks for these patterns. It's very helpful for end users to see some form of best and recommended practices.

morgante commented 4 years ago

Thanks for the data point. We definitely want to support Terraform Cloud, either by building native resources or (as you've seen) referencing the gcloud module.

One ask: as you encounter other modules which don't work in Terraform Cloud please do file issues. At the very least, we can update them to use the gcloud module.

pascalwhoop commented 4 years ago

@morgante thanks for being open to this I assume rather large refactor.

As a data point, we're a company that would like to apply the cloud foundation toolkit to enable clients with a production grade GCP cloud quickly. This relies on a number of the modules in this github org, many of which again rely on this module. So it seems the project-factory is deeply nested in all of this.

I ran into several issues already:

So while I really love the CFT initiative and would like to fully support it, this special behavior seems like a painful crack in the otherwise elegantly provided best-practice IaC toolkit.

So solution oriented: Can you provide a plan on how to get the local exec code out of the module? I'm happy to help out on this but I guess it will require a decent amout of coordination from the google provider team, you guys etc. If you can provide an overview / plan and a set of tasks (maybe via a shared github issues tag), I'm happy to pick some of these up

umairidris commented 4 years ago

Since this issue was opened, I added a check for verifying billing account permission before a project is created in the Google provider itself. This was the main issue that I would run into that this module prevented.

GCP also now has an org policy to deprivelege the default service account.

Thus, I think the only remaining issue is a required API not being enabled.

Given the above, I think we can safely remove a lot of the current checks and recommend the native solutions. For the API checks, further investigation is needed to implement a native solution. Perhaps it makes sense to move the preconditions check out of the core project factory module into its own module that is optionally called from the main module and can be bypassed by a flag?

morgante commented 4 years ago

We could potentially move it into a separate module, but I think the priority should be on removing the need for it entirely.

pascalwhoop commented 4 years ago

@morgante can you point me to some tasks that I could pick up? I don’t have the broad vision of all dependencies yet but I’m happy to pick up a piece or two if you guide me.

morgante commented 4 years ago

I've opened #407 to track work needed to deprecate the script.