hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.
https://www.terraform.io/
Other
42.54k stars 9.52k forks source link

prevent_destroy does not prevent destruction when removing a resource from the configuration #17599

Open samjgalbraith opened 6 years ago

samjgalbraith commented 6 years ago

The documentation for prevent_destroy says

This flag provides extra protection against the destruction of a given resource. When this is set to true, any plan that includes a destroy of this resource will return an error message

However, removing or commenting out a resource seems to still destroy it with no warning, even though technically it was a plan that includes the destruction of the resource. What I expected was that the state file would remember that it was created with prevent_destroy, and prevent its destruction even if it was no longer part of the parsed configuration. It would be reasonable to expect that the user would be forced to remove the prevent_destroy flag, terraform apply, then remove it from the configuration if they were absolutely sure they wanted to go ahead.

Terraform Version

Terraform v0.11.3
+ provider.aws v1.11.0

Terraform Configuration Files

locals {
  dynamo_db_mutext_attribute = "LockID"  # The name must be exactly this
}

provider "aws" {
  profile = "test"
  region = "ap-southeast-2"
}

resource "aws_s3_bucket" "test_environment_terraform_backend_storage" {
  region = "ap-southeast-2"
  bucket_prefix = "SOME-BUCKET-NAME_PREFIX-"
  versioning {
    enabled = true
  }
  tags {
    Environment = "test"
  }
  # We explicitly prevent destruction using terraform. Remove this only if you really know what you're doing.
  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_dynamodb_table" "test_environment_terraform_backend_mutex" {
  "attribute" {
    name = "${local.dynamo_db_mutext_attribute}"
    type = "S"
  }
  hash_key = "${local.dynamo_db_mutext_attribute}"
  name = "test_environment_terraform_backend_mutex"
  read_capacity = 5
  write_capacity = 5
  # We explicitly prevent destruction using terraform. Remove this only if you really know what you're doing.
  lifecycle {
    prevent_destroy = true
  }
}

Expected Behavior

Resource is prevented from being destroyed because it was created with lifecycle { prevent_destroy = true }

Actual Behavior

Resource with prevent_destroy set to true is destroyed when commenting out or removing from configuration.

Steps to Reproduce

  1. Enter the above Terraform configuration into a new module.
  2. terraform apply and say yes.
  3. Now, comment out the s3 bucket resource.
  4. terraform apply again, and say yes

Additional Context

References

apparentlymart commented 6 years ago

Hi @theflyingnerd! Thanks for pointing this out.

Currently removing a resource from configuration is treated as intent to remove the resource, since that's an explicit action on your part rather than it happening as a part of some other action, such as resource replacement or a whole-configuration destroy.

However, your idea of retaining this in the state and requiring multiple steps to actually destroy is an interesting one:

  1. Leave resource in config but remove prevent_destroy
  2. Run terraform apply to update the state to no longer have that flag set.
  3. Remove the resource from configuration entirely.
  4. Run terraform apply again to actually destroy it.

I'd like to think about this some more because while this does seem safer it also feels like this flow could be quite frustrating if e.g. you've already committed some reorganization to your version control that includes removing a resource and then Terraform refuses to apply it until you revert and do steps 1 and 2 from the above.

In the mean time, we should at least update the documentation to be more explicit about how prevent_destroy behaves in this scenario.

samjgalbraith commented 6 years ago

Great, thanks for your quick and thorough reply @apparentlymart . I agree that a documentation update would be sufficient for now and the proposed functionality could be left open for debate. My comments about this process were a mixture of how I might expect it to work if I didn't read any more documentation as well as a proposal. It's influenced by the idea of AWS instance termination protection. This protection must be explicitly removed first before the instance may be terminated, even if the instance is explicitly terminated by id.

a-nldisr commented 6 years ago

Please see: https://github.com/hashicorp/terraform/issues/16392

dguisinger commented 6 years ago

Any further thoughts on this? I used Terraform to setup build pipelines including AWS CodeCommit for storage of code. A mistake in terraform will wipe out the entire contents of the Git repo. I too want a way to protect resources in case they get removed from a script.

tdmalone commented 6 years ago

Potential duplicate of #3468

roidelapluie commented 6 years ago

Strange that 17599 closes 3468 and not the opposite but that is still a very wanted feature.

JustinGrote commented 5 years ago

This was (is) a major risk factor of accidental deletion and was unacceptable for us using terraform in production.

This is our workaround for this that runs in our Azure DevOps pipeline on every commit before application, I'll see if I can get the code (Powershell Pester Test) approved to be shared.

  1. Run Terraform plan and get list of resources to-be-destroyed
  2. Review the previous state (retrieved from previous commit) for any to-be-destroyed resources that had prevent_destroy set and if so, fail the test, and stop the pipeline.
  3. To fix, two commits are required, one to remove prevent_destroy (which doesn't trigger the destroy resource and thus doesn't trigger failing the test), and then one to remove the resource from the config.

In addition to this all production changes have an approval step that formats the terraform apply state very clearly and has to be approved by all the team members.

Zeal0us commented 5 years ago

I currently don't intend to manage certain things with Terraform due to this. As part of deployment of new code (i.e. lambda, ECS containers, api gateway, etc...) my Terraform files are built dynamically, so although it seems like removing it from the .tf files is explicit approval, it could just as easily be a coding error.

As such, managing any type of Dynamo table or RDS or something is out of the question, and those will just have to be managed manually. While I'm kind of ok with this, it certainly would be nice to just manage the entire application and still be able to sleep.

simlu commented 5 years ago

Any update on this? It seems like quite an important feature...

JustinGrote commented 5 years ago

I would imagine all their energies are on getting 0.12 out the door right now.

On Mon, Mar 4, 2019 at 7:37 PM simlu notifications@github.com wrote:

Any update on this? It seems like quite an important feature...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hashicorp/terraform/issues/17599#issuecomment-469523875, or mute the thread https://github.com/notifications/unsubscribe-auth/AOjVUmLqlNPxCZNSo6y4R-PqzZVILSSXks5vTeZcgaJpZM4StAww .

evmin02 commented 5 years ago

+2

CodeSwimBikeRun commented 5 years ago

omg. I'm looking at how to just prevent helm provider from destroying a chart every time I deploy. I came here to see this amazing prevent_destroy trait that I want to try out, but judging from this, it doesnt even work?

You are willing to wipe out a whole database created with prevent_destroy flag on? What the heck?

tdmalone commented 5 years ago

@CodeSwimBikeRun Your comment before the edit was more correct - it works, but not if you remove the resource from your .tf file, because by doing so you're then also no longer setting prevent_destroy. This issue is about potentially storing that in the state so that it the instruction can potentially still remain around (I'm not sure if this is particularly what you were after or not).

You can, however, set flags on some resources (such as AWS EC2 and RDS instances) to prevent deletion at the AWS API level, which will help you no matter what.

Not sure the language is warranted.

nbering commented 5 years ago

@CodeSwimBikeRun This issue is about a specific case, in which the configuration block for the resource no longer exists. In which case, Terraform is following instructions in the strict sense that the resource is no longer configured with prevent_destroy.

Terraform interprets the configuration files as your desired state, and the state file (and/or refreshed in-memory state) as the current state. Destruction becomes a complex edge-case in this way because the absence of a resource implicitly means that you desired it to be destroyed, but in practice this may not always be true. There are other cases to watch out for which are similar, but are consequences of the basic design. For example, https://github.com/hashicorp/terraform/issues/13549.

The feature works - in the general sense - and is worth using. But it's still pretty important to examine the plan before applying changes. Declarative configuration systems in general are very powerful, but also come with an inherent risk of accidentally automating failure (see: Automation: Enabling Failure at Scale for an example outside Terraform). I personally find that Terraform's plan feature is actually superior to some other declarative configuration tools I've used in that regard. It will tell you, before changing anything at all, that it wants to destroy a resource.

rusik69 commented 5 years ago

we had part of our production deleted because of this :(

beanaroo commented 4 years ago

Storing last set prevent_destroy in state would be beneficial. Especially when generating tf plans programmatically.

bionicles commented 4 years ago

seems like a no-brainer safety-critical feature ... prevent destroy should work

aaronmell commented 4 years ago

Would be nice to have a feature added like Cloudformations Retain. If the resource is removed from the file, CF doesn't try to remove the resource, and just removes it from the state file.

eric-spitler commented 4 years ago

I believe the correct way to accomplish this would be to use

terraform state rm <identifier>

or in your case

terraform state rm aws_s3_bucket.test_environment_terraform_backend_storage

This tells Terraform to stop managing the S3 bucket, but will not destroy it. After removing it from the state, you can safely comment out the resource from the configuration file.

That said, from an automation perspective you may not have the ability to manipulate the terraform.tfstate state file yourself. You would probably need your automation to run a script that performs the terraform state rm before terraform plan and terraform apply.

tonyjiang commented 4 years ago

@apparentlymart - in my opinion, @samjgalbraith's comment is not only interesting, it describes the only correct way to do it. Currently I'm not using Terraform to manage my data stores because of the lack of this feature (like some other folks on this thread).

erSitzt commented 4 years ago

@eric-spitler i agree on the correct way to do things, but its the incorrect stuff that should not lead to catastrophic failure of infrastructure :) An easy way to declare my intent to never delete anything if not explicitly requested would be great.

brabster commented 4 years ago

A thought on the inconvenience aspects mentioned by @apparentlymart way up top - perhaps there could be an override? Like a command-line flag that globally disabled the prevent_destroy behaviour, or took a list of resources to disable the behaviour for? Something like that could prevent accidental deletion but still allow explicitly intentional deletion without significant inconvenience.

eschulma commented 4 years ago

I just discovered this and am fairly horrified -- I thought this was the entire point of prevent_destroy. Of course I protect at the API level too but this is just not acceptable as a risk. I will have to keep KMS and RDS out of Terraform, which is a real shame.

HavyRhl commented 4 years ago

I do agree, I would also like some way to REALLY prevent accidental destruction of critical resources. @apparentlymart 's comment at the top, actually mimicks the behavior i expected when I first read about the prevent_destroy flag. I would realy like to have something like that.

ruifrvaz commented 4 years ago

i'm having the same problem when using azure rm. my "azurerm_app_service_custom_hostname_binding" gets destroyed every time, which breaks the dns provider proxy and forces me to disable and reconfigure every time. i'll have to rely on a powershell script for now. would be great if this feature would be clarified or enforced. as it is, terraform complains that it cannot follow "the plan", but doesn't explain.

burizz commented 3 years ago

Although it seems to be the intended behaviour when you remove a resource for it to also remove the prevent_destroy flag and delete protection, it is sometimes risky in the context of CI/CD pipelines. In some cases we keep a central RDS module and individual applications use it, so each application has its own main.tf file where it is called and it sometimes happens where different teams make changes to the code in their repository and run pipelines. We would like to protect the production databases to not be removed by accident even if someone removes the whole RDS resource from their application.

eestolano commented 3 years ago

@burizz You might want to consider keeping the Terraform config for the RDS resources in a separate repo with its own Terraform state, and having the applications' Terraform config use remote state as necessary to access its output.

burizz commented 3 years ago

@eestolano Thanks, that is not a bad idea. I will keep it in mind in the future. Although at this stage it would require significant refactoring of pipelines and testing since DBs are already used in Production with Terraform that is part of the main application state file.

va3093 commented 3 years ago

I ended up resolving this by just protecting my resources with iam on aws. For me I wanted to prevent our user pool from being deleted.

resource "aws_iam_policy" "prevent_user_pool_deletion" {
  name = "prevent_user_pool_deletion"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
        "Effect": "Deny",
        "Action": [
          "cognito-idp:DeleteUserPool",
          "cognito-idp:DeleteUserPoolClient",
          "cognito-idp:DeleteUserPoolDomain"
        ],
        "Resource": [
            "arn:aws:cognito-idp:${var.region}:${local.account_number}:userpool/eu-west-3_xxxx"
        ]
    }
  ]
}
EOF
}

resource "aws_iam_group_policy_attachment" "attach" {
  group      = "admin"
  policy_arn = aws_iam_policy.prevent_user_pool_deletion.arn
}

With this in place I have less fear of accidentally deleting our user pool.

JustinGrote commented 3 years ago

Same in Azure with Azure Locks, but ideally a "global" terraform solution would be preferred.

burizz commented 3 years ago

@va3093 This is actually a very good idea. I will try it.

shermanericts commented 3 years ago

Hey guys. Sorry to bump this old thread. I read this last night. Great comments! Caught up.

I'm definitely one of those in the camp of "We want to run this for our github repos, but fear the accidental catastrophic deletions thereof"

Things I'm mulling over on:

  1. Our Github Personal Access Token is set up to not delete repos. That's good. Hopefully Github can come up with some more granular scopes akin to above solutions on solving this at an IAM/roles/permissions level.
  2. Another trend I'm noticing is that in Database resources (such as GCP sql_database_instance), new versions of the provider have come out with deletion_protection=true by default, which is great!
  3. I'm wondering if future versions of the terraform Github provider could follow suit. For example, one thought could be in github_branch_protection we could have a similar flag of deletion_protection=true. This would make me sleep better at night as it makes it much harder imho for the accidental catastrophe. Last I checked I didn't see this feature, but I could be missing it.

That all being said, message(s) received above by vigorously testing and leveraging terraform plans in our pipelines. Also happy to put this feature request in the appropriate repo as this may differ from the original intent on the ticket.

Apologies if this has already been solved/addressed somewhere else. Feel free to link me whenever appropriately.

Thanks!

pdesgarets commented 3 years ago

Hi @apparentlymart , this issue has been opened for 3 years. What elements are missing to fix this ? Do you need a PR ? Or do you need some internal validation ? If so, could you please advocate to prioritize this ? I see on forums/reddit/etc the major barrier to adoption for IaC is the fear to see your prod environment wiped out. This is precisely what we are talking about here.

I personally believe that the 4-steps process you described is the way to go. If you want to remove protected resources and you apply in CI, you should do it in two successive runs, the first to remove the protection, the second to achieve the deletion.

I don't find @va3093's solution relevant, because If I accidentally delete the lines describing the aws_iam_group_policy_attachment in my configuration, the protection is lost. It could work If I set the attachment manually, without describing in the configuration. But then Infrastructure as Code is dead...

Thanks.

cmawhorter commented 3 years ago

this has bit me too.

something like a tombstone would fix this. maybe something like:

tombstone resource "aws" "somethingimeantodelete" {} or resource "aws" "somethingimeantodelete" { tombstone = true }

If you remove a block and the resources exist, terraform should error. That is pretty much the inverse of trying to create a resource that was created outside terraform.

Then after you run that and it deletes the resources, you could remove the tombstone record without error.

No matter what deleting resources without being explicitly told to do so is bad.

penfold commented 2 years ago

I'm looking at configuring Azure FrontDoor which means that I'll need to reference Azure DNS zone. As I currently use the same domain (different sub-domains) for live and test environments, then I run the risk of destroying live the DNS zone when I clean down a test environment.

Is there any thought on implementing the change that @apparentlymart suggested at the start of the thread?

nbering commented 2 years ago

@penfold Your case sounds like terraform_remote_state should be a suitable workaround. Or possibly a better implementation all-around depending on your constraints.

https://www.terraform.io/language/state/remote-state-data

You could put your shared resources in a separate state project to protect the dependant Terraform runs from clobbering it's resources, but still benefit from referencing values from it's resources across environments.

HariSekhon commented 2 years ago

This is still such a huge problem for everything containing state/data - from Databases to GitHub repos.

Terraform is simply not production-grade for these use cases until it records the prevent_destroy in the state file.

aton-elvisb commented 2 years ago

Hi @apparentlymart, as an alternative to the following workflow to remove a protected resource:

  1. Leave resource in config but remove prevent_destroy
  2. Run terraform apply to update the state to no longer have that flag set.
  3. Remove the resource from configuration entirely.
  4. Run terraform apply again to actually destroy it.

we could also think about an alternative workflow:

  1. Remove the resource from configuration entirely.
  2. Run terraform state allow-destroy <resoure name> (a new command)
  3. Run terraform apply to actually destroy it.

The order of steps 1 and 2 is not relevant. The step 2 removes the destroy prevention from the state.

If the the step 1 is not executed, so that the resource remains in the configuration entirely, but the step 2 is executed, then the terraform apply at step 3 simply reintroduce the protection.

m98 commented 2 years ago

I ended up resolving this by just protecting my resources with iam on aws. For me I wanted to prevent our user pool from being deleted.

resource "aws_iam_policy" "prevent_user_pool_deletion" {
  name = "prevent_user_pool_deletion"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
        "Effect": "Deny",
        "Action": [
          "cognito-idp:DeleteUserPool",
          "cognito-idp:DeleteUserPoolClient",
          "cognito-idp:DeleteUserPoolDomain"
        ],
        "Resource": [
            "arn:aws:cognito-idp:${var.region}:${local.account_number}:userpool/eu-west-3_xxxx"
        ]
    }
  ]
}
EOF
}

resource "aws_iam_group_policy_attachment" "attach" {
  group      = "admin"
  policy_arn = aws_iam_policy.prevent_user_pool_deletion.arn
}

With this in place I have less fear of accidentally deleting our user pool.

I think the problem with this solution is that if this policy attachment is accidentally getting removed from the .tf config, then there is a chance of destroying the Cognito User Pool resource. It's very similar to the prevent_destory lifecycle, but with more code!

I think another way is to limit the IAM user that Terraforms itself is using. You can add this policy to the user:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "NotAction": [
        "cognito-idp:AdminDeleteUser",
        "cognito-idp:DeleteGroup",
        "cognito-idp:AdminDeleteUserAttributes",
        "cognito-idp:DeleteUserPool",
        "cognito-idp:DeleteIdentityProvider",
        "cognito-idp:DeleteResourceServer",
        "cognito-idp:DeleteUserPoolClient",
        "cognito-identity:UnlinkDeveloperIdentity",
        "cognito-idp:DeleteUserPoolDomain",
        "cognito-identity:DeleteIdentityPool",
        "cognito-idp:DeleteUserAttributes",
        "cognito-identity:DeleteIdentities",
        "cognito-identity:UnlinkIdentity",
        "cognito-idp:DeleteUser"
      ],
      "Resource": ["*"]
    }
  ]
}

But anyway, I think Terraform itself should not delete any user pool that has at least one user in it. It does the same for the S3 bucket that can't get destroyed if there is an object in it.

lpil commented 2 years ago

Hi gang. We got bit by this yesterday. Is there a workaround yet?

koreyGambill commented 2 years ago

I'm trying to automate a CI/CD solution for terraform, but am deeply concerned with auto-applying changes on merges to main because of this issue. If someone doesn't look over a plan properly, they could cause a pretty drastic outage, and we could lose a lot of data. I suppose my real concern is managing anything with terraform that contains persistent data, and doesn't have a prevent deletion flag available to be set.

The best solution (workaround) I see in these comments is checking previous terraform state before doing a deploy as https://github.com/hashicorp/terraform/issues/17599#issuecomment-451616797 suggested, or perhaps not being very permissive with terraform's permissions through IAM. But I'd much rather a first class solution.

There are multiple good (and seemingly easy) solutions that could be implemented here. I liked the Tombstone idea, or doing something like a Cloudformation Retain, or ensuring that the lifecycle is negated in the previous apply, or another terraform apply flag like terraform apply --allow-destroy that can destroy resources with that lifecycle set. I think we'd all love to see a native, intuitive way to prevent an accident from wiping out data.

Is HashiCorp not worried about this issue?

HariSekhon commented 2 years ago

@koreyGambill surprisingly dangerous, isn't it?

I've automated Terraform CI/CD for 2 companies in 2 different ways for you to consider:

  1. auto-applies on merge to main using my GitHub Actions reusable workflow, the plan is pasted into the Pull Request ticket, but once you merge that PR, that's it, boom. This is probably the one that scares you more.
  2. Jenkins runs the Plan from the main branch and then uses an interactive Approval before doing the Apply. This uses a saved plan so you know exactly what Terraform is going to apply if you approve the Jenkins pipeline final stage and it only applies that.

Even with this 2nd stronger option, there is still always the possibility of human error that somebody will approve without checking properly and it'll destroy something critical.

For such critical data resources, you might have to keep them outside of Terraform until this issue is solved.

Harrisonbro commented 1 year ago

Just adding a +1 to this. I'm really struggling to recommend Terraform to our clients for this reason.

simonweil commented 1 year ago

@Harrisonbro terraform is not perfect, but what is the better alternative that has no downsides of it's own?

Harrisonbro commented 1 year ago

@Harrisonbro terraform is not perfect, but what is the better alternative that has no downsides of it's own?

Nothing has no downsides, of course. We do recommend terraform for stateless resources that can be recreated, but for databases and such we have to recommend clients use other approaches (often just manual config) because they can't accept the risk of a Git commit wiping out data.

whiskeysierra commented 1 year ago

The way we deal with this in our company is twofold:

  1. Locks - depends on your cloud provider. Azure has management locks which we will put on databases and database servers preventing any delete, not just via terraform.
  2. Lack of permissions - we restrict the identity that terraform assumes to not have the necessary permissions to delete.
maxb commented 1 year ago

There are undoubtedly many possibly workarounds. Indeed, where I work, we've resorted to patching terraform-provider-github to embed the feature at the provider level specifically for github_repo resources.

However given the outpouring of supportive sentiment, @apparentlymart could you give us an update on your comment from 2018?

simonweil commented 1 year ago

I use delete protection and final snapshots, this prevents the DB from being deleted by mistake...

ben-z commented 1 year ago

There are undoubtedly many possibly workarounds. Indeed, where I work, we've resorted to patching terraform-provider-github to embed the feature at the provider level specifically for github_repo resources.

@maxb any chance the patch to terraform-provider-github is open source? My org has the same use case and we don't want to reinvent the wheel.

maxb commented 1 year ago

It is not (and I no longer work at that employer). However, the modern github_repository resource has a setting archive_on_destroy which can be turned on, which provides a different safety net to accidental repository deletion.