kubernetes_manifest does not wait for dependencies

hashicorp / terraform-provider-kubernetes

Terraform Kubernetes provider

https://www.terraform.io/docs/providers/kubernetes/

Mozilla Public License 2.0

1.6k stars 976 forks source link

kubernetes_manifest does not wait for dependencies #1782

Open iosifnicolae2 opened 2 years ago

iosifnicolae2 commented 2 years ago

Kubernetes_manifest does not wait for dependencies.

Terraform Version, Provider Version and Kubernetes Version

Terraform version:  v1.2.4
Kubernetes provider version: v2.6.1
Kubernetes version:  v1.21.5+k3s2

Affected Resource(s)

kubernetes_manifest

Terraform Configuration Files

resource "kubernetes_manifest" "longhorn_backups" {
  depends_on = [rancher2_app_v2.rancher_longhorn]
  manifest   = {
    "apiVersion" : "longhorn.io/v1beta1",
    "kind" : "RecurringJob",
    "metadata" : {
      "name" : "automatic-backups",
      "namespace" : "longhorn-system"
    },
    "spec" : {
      "concurrency" : 3,
      "cron" : var.longhorn_backup_cron,
      "retain" : 1,
      "task" : "backup"
      "groups" : ["default"]
    }
  }
}

Panic Output

The plugin returned an unexpected error from plugin.(*GRPCProvider).ReadResource: rpc error: code = Unknown desc = no matches for kind "RecurringJob" in version "longhorn.io/v1beta1"

Steps to Reproduce

terraform apply

Expected Behavior

We were expecting the kubernetes_manifest plugin to wait on dependencies and then check the status of the manifest.

Actual Behavior

We're getting an error because the CRD was not deployed

alexsomesan commented 2 years ago

Hello,

Thanks for raising this. Unfortunately, this is not a bug, but a known limitation. The limitation stems from the way Terraform itself arranges the operations of planning and applying. All resources that are part of the same operation are first planned for and only after all resources are finished planning do they begin to get applied.

Meanwhile, the kubernetes_manifest resource reaches out to the Kubernetes API during the planning phase to retrieve structural information (schema) for the particular resource kind that it needs to manage. The issue you are seeing is that the CRD's structural information isn't yet available on the API server during planning, because, as mentioned above, all planning has to complete before any resources get applied. This means that the CRD hasn't yet been created on the API server by the time planning for the CR resource tries to retrieve it's structural info.

There is a simple work-around to this issue: split your configuration into two (or more) applies. The first apply should create all the CRDs, then you can move on to a second apply operation to create the CR objects and any other K8s resources you may need. This will ensure that the CRDs are available on the API server by the time Terraform needs to plan the CR resources.

We will try to make this aspect more clear in the provider documentation.

Hope this is useful and unblocks your work.

steve-gray commented 2 years ago

What advantage does validating the manifest during the apply phase have? I'm at a loss here as to the rationale between the limitation/constraint on the usage of this and the down-wind complexity this decision brings to people using this provider. The provider would work equally effectively if the CRD was validated at the apply stage, because you'd need all the variable values interpolated from the spec to even know if they're valid - so it has to validate again later anyway (i.e. you're already on the wrong side of the airtight hatch-way when it comes to spec validation).

CRD's are inherently a runtime consideration, dynamically updatable on the fly in the cluster, and something where static planning of them during the plan phase offers nothing? Helm and other providers don't error in similar ways and preventing people being able to deploy a cluster and the dependant resources in a single pass feels like a bad call.

I'm happy to do up a PR to give this the punt to the right location if folks buy into it?

pathob commented 1 year ago

@alexsomesan

I have to agree on everything that @steve-gray is saying.

The limitation stems from the way Terraform itself arranges the operations of planning and applying. All resources that are part of the same operation are first planned for and only after all resources are finished planning do they begin to get applied.

I don't see a contradiction here to what @steve-gray is suggesting. Why can't we plan a CR based on a CRD that doesn't exist yet? Terraform doesn't have to be that strict and also isn't in practice in so many cases. I'm mostly working with the azurerm provider and how many times have I seen Terraform failing during the apply step for ridiculous reasons like a case-sensitive string validation that has not been done during the plan step. So why be so strict here?

Another approach that would match the way for example Helm works is to add a validate_during_plan flag or similarly named, just as helm template does, see https://helm.sh/docs/helm/helm_template/

There is a simple work-around to this issue: split your configuration into two (or more) applies. The first apply should create all the CRDs, then you can move on to a second apply operation to create the CR objects and any other K8s resources you may need.

That's not as simple as it sounds like. Many workflows just don't work like that. We're for example using Terragrunt and its mocking features to run ALL plans first and then ALL applies. So splitting this up into different modules wouldn't solve the issue.

I would like to see this discussion reopened. Thanks in advance.

kutysam commented 1 year ago

What advantage does validating the manifest during the apply phase have? I'm at a loss here as to the rationale between the limitation/constraint on the usage of this and the down-wind complexity this decision brings to people using this provider. The provider would work equally effectively if the CRD was validated at the apply stage, because you'd need all the variable values interpolated from the spec to even know if they're valid - so it has to validate again later anyway (i.e. you're already on the wrong side of the airtight hatch-way when it comes to spec validation).

CRD's are inherently a runtime consideration, dynamically updatable on the fly in the cluster, and something where static planning of them during the plan phase offers nothing? Helm and other providers don't error in similar ways and preventing people being able to deploy a cluster and the dependant resources in a single pass feels like a bad call.

I'm happy to do up a PR to give this the punt to the right location if folks buy into it?

Assuming you are creating argocd helm_release and then running a manifest

    apiVersion = "argoproj.io/v1alpha1"
    kind       = "Application"

It will not work because helm_release needs to be run and depends_on for kubernetes manifest will not work

observe-ryan commented 1 year ago

I agree with @steve-gray and @pathob here; failing because you can't validate that the manifest will apply correctly because the control plane isn't available is a bug. In my opinion, the situation where the api server cannot be contacted should not fail the plan or apply but rather put those resources into a "known after apply" stage.

At this point, a plan will "succeed" and an apply will handle any upcoming error as the resources are actually applied. If the appropriate depends_on isn't defined then the apply should fail with the appropriate error code.

online01993 commented 1 year ago

up, pls!

suryastef commented 1 year ago

This incident happens when your cluster doesn't have the required CRD applied, that makes terraform will fail when plan or applying the manifest.

There is a workaround for this by using separate kubernetes manifest (yaml) file resource "kubectl_manifest" from gavinbunney/kubectl, and of course you can't manipulate the manifest (yaml) file using tf var, locals, etc.

Somehow the resource "kubectl_manifest" doesn't need the required CRD to be aplied first, and depends_on is fully supported like a regular tf resource.

here is my tf file for reference:

resource "kubectl_manifest" "cnpg_db_tools" {
  yaml_body = file("manifest/cnpg-pg-cluster.yaml")
  depends_on = [
    kubernetes_secret.superuser_secret,
    kubernetes_secret.init_db_secret
  ]
}

PietervdWerk commented 1 year ago

@suryastef kubectl_manifest workaround works, though for a reason I cannot explain the error showed now after the resource was applied (while the resources applied successfully). I added wait = true which resolved this issue.

wait - Optional. Set this flag to wait or not for finalized to complete for deleted objects. Default false.

alexsomesan commented 1 year ago

There is a bit of misunderstanding in this conversation. The manifest isn't just getting validated during plan, at least not for value consistency. The one and only reason why we need the CRD present at plan time is so that we can synthesise the type structure of the resource to conform to Terraform's wire and state storage format, which are strongly typed. Not fully defining the type structure of the manifest during the first plan would result in broken state situations further down the road as apply adds and removes attributes. This is dictated by the way Terraform itself is designed and not something us as provider maintainers can change or avoid.

One of the strong-points of Terraform compared to plain-text tools (e.g. Helm) is the ability to granularly vet and validate the planned changes before they are actually performed. Depending on the criticality of your use-case, this fact may or may not be important for you, but we have a lot of users with complex use-cases that value this feature and we cannot degrade this experience as whole.

As was mentioned above, there are community providers available that will handle the manifest as a blob of text instead of a typed structure. The downsides are that you will not be able to see changes to its individual attributes during planning, but more importantly you will not be able to reference attribute values in expressions on other resources or outputs.

Weigh the trade-offs and chose what works best for you, but please consider the size and diversity of the Terraform user base before dismissing a design decision as "a bug".

rayrapetyan commented 9 months ago

Is there any other workaround than putting:

count = var.create_istio_vs == "true" ? 1 : 0

into every manifest and run terraform apply twice like this:

terraform apply
terraform apply -var "create_istio_vs=true" -auto-approve

? The workaround above works fine for me, but just curious if there are more.