terraform-google-modules / terraform-example-foundation

Shows how the CFT modules can be composed to build a secure cloud foundation
https://cloud.google.com/architecture/security-foundations
Apache License 2.0
1.18k stars 702 forks source link

Provide mechanism for cleanup after failed deployment to enable re-deployment #1240

Open mromascanu123 opened 2 months ago

mromascanu123 commented 2 months ago

TL;DR

Need to selectively remove from the environment and from tfstate the already created resources after a failed deployment .This has two sides:

For the first item this would be a script replicating the manual steps below

  1. In asset manager position on the folder to cleanup and list the cloudresourcemanager.Project resources
  2. Extract the project_id for each of the projects to clean-up
  3. For each project_id run gcloud billing projects unlink
  4. For each project_id identify and extract the "liens" if any: gcloud alpha resource-manager liens list --project
  5. Delete the liens : gcloud alpha resource-manager liens delete --project
  6. Delete the projects gcloud projects delete --quiet

For the second item add in tf-wrapper.sh 2 options :

Terraform Resources

N/A

Detailed design

See TL;DR* above

Additional information

Related #1238

eeaton commented 1 month ago

Hi @mromascanu123 , can you help me understand more about your desired outcome, and in what scenario you want to use this script? Is it something that isn't already addressed by using the helper script to automate the manual steps of deploying with Cloud Build, then destroying?

While there are some flaky errors that require unpicking state like #1187, they are specific enough that I don't recommend creating a script to directly modify terraform state. (Usually modifying terraform state files by any method other than apply should be done only as a last resort). Many other errors that might occur when a deployment fails require some other fix outside of the terraform state (modify IAM policy of the principal doing the deployment, remove a pre-existing org policy that blocks the deployment, modify the tf files) then triggering the terraform apply again.

mromascanu123 commented 1 month ago

Hi @eeaton. One of the issues I've seen is the persistence of the resource "random_string" in the tfstate when the "plan" decides to delete and recreate a project following a failed deployment. The project-factory module will attempt to recreate the roject but the resulting id will be the same as the one of the just-deleted project and obviously will fail.

I was able to reproduce the issue at least once by aborting a tf-wrapper apply with a Ctrl-C then re-planning and re-applying. But a failed deployment may occur for many other reasons.

The resource "random_string" is being used in many places and it persists after deletion of the actual resource using it tgo generate a suffix but afaik the projects and KMS keystores persist after being deleted and their name / id can't be reused

in tf-wrapper.sh "list" and "remove" operations could be added to list the resource IDs in tfstate and e.g. selectively delete as necessary the random_string resources which served to generate IDs for resources deleted but still zombified

mromascanu123 commented 1 month ago

Another issue seen repeatedly is described in #1228 - even when retrying with tf-wrapper plan and then apply and apparently succeeding, in reality this ends up in tfstate corruption for that particular job (e.g.development under 3-nhas). In this case, for safer recovery must delete all created resources by that job and also the corresponding resource IDs in tfstate and redo the plan and apply, obviously crossing fingers not to hit the snag again.