hashicorp / terraform-provider-helm

Terraform Helm provider
https://www.terraform.io/docs/providers/helm/
Mozilla Public License 2.0
999 stars 370 forks source link

Release 2.10.0 appears to be destructive and not backwards compatible #1285

Closed lorelei-rupp-imprivata closed 11 months ago

lorelei-rupp-imprivata commented 1 year ago

We were looking to upgrade to the latest provider, but noticed in doing so with the 2.10.0 release and above all the plans appear to be very destructive to helm resources already in production

When updating our provider to 2.10.0, we see plans that show every helm release needs to be replaced. The Release notes https://github.com/hashicorp/terraform-provider-helm/releases/tag/v2.10.0 do not mention anything of the sort.

Is this intentional? I cannot imagine we are the only one that would be impacted here and this prevents us from ever being able to upgrade. Is there a migration path so its not destructive? Is there more details on what is going to happen perhaps?

This is what the plan now shows for example

  # helm_release.dex-pdb must be replaced
-/+ resource "helm_release" "dex-pdb" {
      ~ id                         = "dex-pdb" -> (known after apply)
      + manifest                   = (known after apply)
      ~ metadata                   = [
          - {
              - app_version = ""
              - chart       = "pod-disruption-budget"
              - name        = "dex-pdb"
              - namespace   = "argocd"
              - revision    = 1
              - values      = jsonencode(
                    {
                      - matchLabels         = {
                          - "app.kubernetes.io/component" = "dex-server"
                          - "app.kubernetes.io/name"      = "argocd-dex-server"
                        }
                      - name                = "dex-pdb"
                      - podDisruptionBudget = {
                          - minAvailable = 1
                        }
                    }
                )
              - version     = "1.0.0"
            },
        ] -> (known after apply)
        name                       = "dex-pdb"
      ~ namespace                  = "argocd" -> (known after apply) # forces replacement
        # (25 unchanged attributes hidden)
    }
......
Plan: 6 to add, 1 to change, 6 to destroy.
arybolovlev commented 1 year ago

Hi @lorelei-rupp-imprivata,

I don't see any disruptive changes introduced in v2.10.0. Could you please share more information to better understand the issue? Please fill in all fields in the issue template such as the current provider version, the resource config, etc.

Thanks.

lorelei-rupp-imprivata commented 1 year ago

Sure! @arybolovlev

Terraform v1.2.7 Helm Provider 2.10.0 Kubernetes 1.24

When I run my argocd module with the helm provider 2.9.0 and no other changes I see this in my plan We have an perpetual /infinite diff because of the credentials/secrets set unfortunately

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # helm_release.argocd will be updated in-place
  ~ resource "helm_release" "argocd" {
        id                         = "argocd"
        name                       = "argocd"
      ~ values                     = [
          - (sensitive),
          + (sensitive),
        ]
        # (27 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

When I just switch the provider to 2.10.0, I see tons of changes on all the helm releases we have in this module, here is just a snippet as its like hundreds of lines

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  ~ update in-place
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # helm_release.argocd will be updated in-place
  ~ resource "helm_release" "argocd" {
        id                         = "argocd"
      ~ metadata                   = [
          - {
              - app_version = "v2.8.4"
              - chart       = "argo-cd"
              - name        = "argocd"
              - namespace   = "argocd"
              - revision    = 99
              - values      = jsonencode(
ACTUALLY PRINTS SECRETS OUT IN PLAINTEXT HERE
.........
                )
              - version     = "5.46.8"
            },
        ] -> (known after apply)
        name                       = "argocd"
      ~ values                     = [
          - (sensitive),
          + (sensitive),
        ]
        # (26 unchanged attributes hidden)
    }
..........
  # helm_release.dex-pdb must be replaced
-/+ resource "helm_release" "dex-pdb" {
      ~ id                         = "dex-pdb" -> (known after apply)
      + manifest                   = (known after apply)
      ~ metadata                   = [
          - {
              - app_version = ""
              - chart       = "pod-disruption-budget"
              - name        = "dex-pdb"
              - namespace   = "argocd"
              - revision    = 1
              - values      = jsonencode(
                    {
                      - matchLabels         = {
                          - "app.kubernetes.io/component" = "dex-server"
                          - "app.kubernetes.io/name"      = "argocd-dex-server"
                        }
                      - name                = "dex-pdb"
                      - podDisruptionBudget = {
                          - minAvailable = 1
                        }
                    }
                )
              - version     = "1.0.0"
            },
        ] -> (known after apply)
        name                       = "dex-pdb"
      ~ namespace                  = "argocd" -> (known after apply) # forces replacement
        # (25 unchanged attributes hidden)
    }

Plan: 6 to add, 1 to change, 6 to destroy.

The actual module code for like the helm release dex-pdb looks like this

resource "helm_release" "dex-pdb" {
  depends_on = [
    helm_release.argocd
  ]
  create_namespace = false
  chart            = "../common-helm-charts/pod-disruption-budget"
  name             = "dex-pdb"
  version          = local.pdb_chart_version
  namespace        = helm_release.argocd.metadata.0.namespace
  values = [
    yamlencode(local.chart_variables_dex-pdb)
  ]
}

It is just using a simple helm chart to install a PDB, nothing fancy, but I think it may be changing because it uses the argocd output as a dependency and that is changing so it might be forcing this to change

The argocd helm release looks like this

resource "helm_release" "argocd" {

  name             = "argocd"
  create_namespace = true
  repository       = "https://argoproj.github.io/argo-helm"
  chart            = "argo-cd"
  version          = var.argocd_version
  namespace        = var.namespace
  values = [
    yamlencode(local.chart_variables)
  ]
}

Its using argos helm chart passing in all the values overrides we do, some have secret values in them

I also just tried this on another module we have for another helm release and it does not show anything destructive, but that one doesn't use secrets, maybe it has to do with this sensitive data stuff perhaps?

I am really scratching my head, happy to provide more things if you need it let me know

lorelei-rupp-imprivata commented 1 year ago

It appears like its recomputing the metadata on this, but nothing has changed. The 2.10.0 release notes are about recomputing the metadata, and that is what this is doing, but this is super destructive for us, its going to destroy argo and recreate it, which is not something we can do. I don't understand why its recomputing it though if nothing has changed on our side

lorelei-rupp-imprivata commented 1 year ago

actually this looks related to my issue https://github.com/hashicorp/terraform-provider-helm/issues/1221 and https://github.com/hashicorp/terraform-provider-helm/issues/1150 and I will say that the Plan shows ALL the senstive data now and it shouldn't, it used to not, when on 2.10.0, when it shows its going to be destructive and destroy/recreate argocd, it shows and prints in plaintext all the secrets out

arybolovlev commented 1 year ago

Hi @lorelei-rupp-imprivata,

Thank you for the detailed explanation and for sharing the code snippets.

The issue you encounter is indeed related to the changes introduced in v2.10.0. The metadata attribute gets recomputed when some attributes are changed. In your case, it is the values one. Since metadata is the computed attribute Terraform marks it as unknown. You can see it from the (known after apply) message. This is what happens with your argocd resource.

In your dex-pdb resource you refer to argocd one via helm_release.argocd.metadata.0.namespace and that builds dependency. As we figured out above the entire metadata will be recomputed and the namespace attribute will be marked as unknown. It will become known only when changes are applied for argocd resource. That is why Terraform wants to replace the namespace attribute in resource dex-pdb even though it will be the same value. Changing the namespace attribute will always cause resource replacement, since it has the ForceNew attribute set to true.

In your case, the way to address the issue would be to change the referred attribute from metadata.0.namespace to namespace. For example:

resource "helm_release" "dex-pdb" {
  ...
  namespace        = helm_release.argocd.namespace
  ...
}

In this case, you still keep dependency and these two resources will be applied sequentially.

I hope it helps.

Thanks!

lorelei-rupp-imprivata commented 1 year ago

@arybolovlev Thanks for the response! I will definitely update to use the .namespace. I had originally thought it might not do the ordering properly because the namespace is computed before its actually created and sometimes terraform optimizes, but that is good to know, I will switch to this.

So for the argocd helm release issue, the problem is the values is NOT changing. We have a perpetual diff for some reason because of sensitive data. SO nothing is actually really changing. So this actually makes the perpetual diff even worse if we go to 2.10.0, because its going to destroy/recreate the release. Why does helm think something is changing when it isn't? Is that an entirely separate bug then?

also with 2.10.0, it prints out in plain text all the secret /sensitive information, when it didn't prior. I think that should warrant a release note too perhaps, because if we ran this in prod, we would have to then go and rotate all our secrets likely because they would have been plain text dumped out, that seems like a bug as well?

arybolovlev commented 1 year ago

If I understand you right, you refer to the following change:

      ~ values                     = [
          - (sensitive),
          + (sensitive),
        ]

I need additional context here. From your code, I understand that values get rendered from the local value chart_variables. It looks like one of the values there marked as sensitive, that is why Terraform marks values as sensitive. I wonder if any value in chart_variables has a dynamic nature? For instance, it get a value from a data source.

I have tested it with the following code and don't observe perpetual diff. I only get diff when update variables:

variable "namespace" {
  default = "this"
}

variable "replicas" {
  default   = 3
  sensitive = true
}

variable "autoscaling" {
  default = true
}

locals {
  chart_variables = {
    "server": {
        "replicas": var.replicas,
        "autoscaling": {
            "enabled": var.autoscaling
        }
    }
  }
}

resource "helm_release" "argocd" {
  name             = "argocd"
  create_namespace = true
  repository       = "https://argoproj.github.io/argo-helm"
  chart            = "argo-cd"
  version          = "v5.51.0"
  namespace        = var.namespace
  values = [
    yamlencode(local.chart_variables)
  ]
}
lorelei-rupp-imprivata commented 1 year ago

Let me check on the local variable in use and get you some details here!

lorelei-rupp-imprivata commented 1 year ago

@arybolovlev Our local.chart_variables is really large, but after scrolling through it here were the points of interest, ie things that were not just strings, or variables passing through

.......
      "clusterRoleRules" = {
        "enabled" = "true"
        "rules"   = yamldecode(file("${path.module}/templates/controller-clusterrole.yaml"))
      }
       .......
      "resource.inclusions" = file("${path.module}/templates/resource-inclusions.yaml")
        "url"                 = "https://${var.argocd_domain}"
        "dex.config"          = templatefile("${path.module}/templates/dex.tmpl", { argo_domain = var.argocd_domain })
        "ui.bannercontent"    = var.banner_content
        "ui.bannerurl"        = var.banner_url
      }
      "rbacConfig" = {
        "policy.csv"     = templatefile("${path.module}/templates/rbac.tmpl", { rbac_entries = var.argocd_rbac_entries })
        "policy.default" = "role:readonly"
        "scopes"         = "[groups]"
      }
....
      "service" = {
        "type" = "LoadBalancer"
        "annotations" = {
          "external-dns.alpha.kubernetes.io/hostname"                          = var.argocd_domain
          "service.beta.kubernetes.io/aws-load-balancer-extra-security-groups" = aws_security_group.allow_vpn.id
        }
        "loadBalancerSourceRanges" = [var.vpc_cidr_block]
      }

I can gist the entire thing if you are interested, but there are no datas, and the only resource we use is a security group. We do a lot of templatefile/ file calls and a yamldecode on a file. It has been doing this perpetual diff for a while now, I think honestly when we made the variables sensitive perhaps

arybolovlev commented 1 year ago

Thanks for sharing this. If there are no dynamic values, then I guess we have only one option left. There is a difference between values in the state file and the one that is produced by yamlencode(local.chart_variables). Any change in template files will cause this diff. Even if you add a space symbol. Since it is marked as sensitive, there is no way to see this diff during the planning stage. I can propose possible troubleshooting steps that should help identify why this diff happens.

I understand this is a production environment, so I would advise you to make a copy of your code and play with it locally in a different folder.

You can copy only variables that you use local.chart_variables and remove the sensitive attribute from them, then the chart_variables itself and all template files. You don't need the helm_release resource or any other resources. In the case of external resources, like aws_security_group.allow_vpn.id -- you can replace it with a variable or even hard code it. The next step would be to run the Terraform console and call yamlencode() function. You need to do this in the same folder where you keep a copy of your code. Something like this:

$ terraform console
> yamlencode(local.chart_variables)
<<EOT
"rbac": |
  create:  true
"server":
  "autoscaling":
    "enabled": true
  "replicas": 3

EOT

Or, you can add an output and run terraform plan:

output "debug" {
  value = yamlencode(local.chart_variables)
}
$ terraform plan
...
Changes to Outputs:
  + debug = <<-EOT
        "rbac": |
          create:  true
        "server":
          "autoscaling":
            "enabled": true
          "replicas": 3
    EOT

The goal here is to get a YAML document that gets rendered by yamlencode(). Then, you need to compare this output with the values in your original code. Since it is marked as sensitive, you won't be able to get it via terraform state show helm_release.argocd but you can get it from the state file. For example:

            "values": [
              "\"rbac\": |\n  create:  true\n\"server\":\n  \"autoscaling\":\n    \"enabled\": true\n  \"replicas\": 3\n"
            ],

Then use echo in your shell to print it out nicely:

$ echo "\"rbac\": |\n  create:  true\n\"server\":\n  \"autoscaling\":\n    \"enabled\": true\n  \"replicas\": 3\n"
"rbac": |
  create:  true
"server":
  "autoscaling":
    "enabled": true
  "replicas": 3

If you compare outputs, it should be a difference between the output you render from the copy of your code and the actual state.

It might be invisible or hard to recognize symbols, so use a tool to get a diff.

Maybe there is an easy way to get diff, but this is the first one that came to my mind for this case.

lorelei-rupp-imprivata commented 1 year ago

Thanks ill look into this to see what it says, its just so odd that it thinks there is a change when there isn't

lorelei-rupp-imprivata commented 1 year ago

Ok I was able to run this and compare in intelij

some things I noticed were

  1. indents in triggers was off
  2. these .tmpl templates that we use that look like this
    email:
    subject: Application {{.app.metadata.name}} sync status is 'Unknown'
    message: |
    {{if eq .serviceType "slack"}}:exclamation:{{end}} Application {{.app.metadata.name}} sync is 'Unknown'.
    Application details: {{.context.argocdUrl}}/applications/{{.app.metadata.name}}.
    {{if ne .serviceType "slack"}}
    {{range $c := .app.status.conditions}}
      * {{$c.message}}
    {{end}}
    {{end}}
    slack:
    .....

    In the state the variables like $c is just empty like an empty space These are loaded in here

    "templates" = {
        "template.app-deployed"            = file("${path.module}/templates/app-deployed.tmpl")
        "template.app-health-degraded"     = file("${path.module}/templates/health-degraded.tmpl")
        "template.app-sync-failed"         = file("${path.module}/templates/sync-failed.tmpl")
        "template.app-sync-running"        = file("${path.module}/templates/sync-running.tmpl")
        "template.app-sync-status-unknown" = file("${path.module}/templates/status-unknown.tmpl")
        "template.app-sync-succeeded"      = file("${path.module}/templates/sync-succeeded.tmpl")
      }
      "triggers" = {
        "trigger.on-deployed"        = file("${path.module}/templates/trigger-on-deployed.tmpl")
        "trigger.on-health-degraded" = file("${path.module}/templates/trigger-on-health-degraded.tmpl")
        "trigger.on-sync-failed"     = file("${path.module}/templates/trigger-on-sync-failed.tmpl")
        "trigger.on-sync-running"    = file("${path.module}/templates/trigger-on-sync-running.tmpl")
      }

    I am not sold though these are the problem

  3. What is different though are the passwords, for like https://github.com/argoproj/argo-helm/blob/main/charts/argo-cd/values.yaml#L531 for example, the value from terraform console is the bcrypt hash $2a value, however the value in state is not. This I think is the culprit, theres two passwords here that the diff shows bcrypt vs the redaction that I think Terraform does because its "sensitive"
lorelei-rupp-imprivata commented 11 months ago

I opened this ticket here https://github.com/hashicorp/terraform-provider-helm/issues/1295 for the perpetual diff issue

lorelei-rupp-imprivata commented 11 months ago

after solving our perpetual diff issue and moving sensitive values to the set_sensitive block we no longer see an issue here