When a release fails to deploy and a plan is ran the reosurce is tainted

rmb938 commented 3 years ago

Terraform Version and Provider Version

Terraform v0.12.20

Provider Version

1.3.2

Affected Resource(s)

helm_release

Terraform Configuration Files

resource "helm_release" "pep-kafka-operator" {
  name       = "pep-kafka-operator"
  repository = var.helm_repo
  chart      = "pep-kafka-operator"
  version    = "0.1.1"
  namespace  = var.namespace
}

Expected Behavior

The resource to not be tainted. Some helm charts contain crds and if the chart fails to deploy during an upgrade the destroy will delete the crds. This is very destructive as deleting crds will destroy all their resources.

This behavior seems to be different than the old helm 2 version of this provider. During a failure it didn't taint the resource, instead it just tried to fix it by re-applying the helm chart.

Actual Behavior

Terraform Plan shows

  # module.kubernetes.module.ctc.module.kafka-operator.helm_release.pep-kafka-operator is tainted, so must be replaced
-/+ resource "helm_release" "pep-kafka-operator" {
        atomic                     = false
        chart                      = "pep-kafka-operator"
        cleanup_on_fail            = false
        create_namespace           = false
        dependency_update          = false
        disable_crd_hooks          = false
        disable_openapi_validation = false
        disable_webhooks           = false
        force_update               = false
      ~ id                         = "pep-kafka-operator" -> (known after apply)
        lint                       = false
        max_history                = 0
...
      ~ status                     = "failed" -> "deployed"
...
...
Plan: 1 to add, 0 to change, 1 to destroy.

Steps to Reproduce

Define terraform that will deploy a helm chart that will fail to deploy, i.e one with a invalid image.
terraform apply
- Wait for the apply to timeout
terraform plan
The resource will be tainted and have to be deleted and re-created

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jseiser commented 2 years ago

Are there any workarounds for this?

For instance, the gitlab helm chart created a stateful set. If for some reason the upgraded chart, or a value was changed etc, and the upgrade failed. I would in no way want to destroy and re-create my production deployment. I would expect to be able to roll back, and re-deploy.

We use TF for everything here, so rolling out our chart deployments with TF made sense, but if we have no way to recover a failed deployment I am really hesitant to go this route. Im much more familar with flux or helmfile to manage the deployments.

kdmenghani13 commented 2 years ago

Is there any fix for this?

we are using 2.2.0 version of helm, every time if the task fail for any reason terraform tries to destroy the deployment and recreate which is not feasible in production environment. Thanks

kdmenghani13 commented 2 years ago

@aareet this is critical for us please help out.

tomwidmer commented 2 years ago

Manually untainting the helm release before trying to redeploy is the current workaround: https://www.terraform.io/cli/commands/untaint

github-actions[bot] commented 1 year ago

Marking this issue as stale due to inactivity. If this issue receives no comments in the next 30 days it will automatically be closed. If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. This helps our maintainers find and focus on the active issues. Maintainers may also remove the stale label at their discretion. Thank you!

markmsmith commented 1 year ago

Still needed

dpr-dev commented 10 months ago

any updates here ?

wondersd commented 5 months ago

Running into this myself and found this issue. In my case, chart installation whose initial requirements end up needing to scale up underlying cluster resources can take variable and unpredictable time to fulfill. In the event that they take longer than the configured timeout, we end up in this position.

Been a while since I've done any terraform provider development, but if memory serves this behavior comes from terraform core. In the event there is a failure to initially create a resource, terraform will take what it perceives as the safest action and marks it tainted to attempt a clean recreate. In many cases this can make sense. Even for helm specifically since it has hooks that only run on that first install so recreating gets them to run again on what is assumed to be a "clean" starting point.

This behavior should be able to be avoided if the provider implementation does a partial save of the resource before failing out the create function. I think this might be as simple as using d.SetId(...) once the release is created but before it has confirmed to be successful. I think there is also a specific "partial" save concept that was meant to be used for resources that have to be implemented as combinations of multiple api interactions with the remote service.

If this is a viable approach, would probably expect to have it optional per helm_release resource as the current behavior may be required for charts that use *-install hooks.

Hopefully this helps move this issue along as the current behavior can definitely be counterproductive in many cases.

hashicorp / terraform-provider-helm