Apply component workflow fails for component of terraform schematic

shreyasHpandya commented 1 year ago

Apply component workflow fails for a component of schematic type terraform

We have a custom component of schematic type terraform which creates dynamodb table. We create tables by applying applications that use this component, and use workflow apply-component.

After few seconds the workflow fails with following error

run step(provider=oam,do=component-apply): CollectHealthStatus: app=test, comp=<redacted>, check health error: no matches for kind "Configuration" in version "terraform.core.oam.dev/v1beta2"

This error is however misleading since it is creating the Configuration object, running the apply job successfully and creating the dynamodb table as expected. The configuration object is also in ready state after sometime.

My guess is that the workflow step waits for a while to see if the Configuration object is in a ready state and fails if it is not. Since apply job takes a while to complete it does not come to ready state in duration expected by the workflow timeout.

Kubevela Version 1.9.6

Component definition

apiVersion: core.oam.dev/v1beta1
kind: ComponentDefinition
metadata:
  annotations:
    definition.oam.dev/description: Terraform module which creates DynamoDB table
      on AWS
  creationTimestamp: null
  labels:
    type: terraform-aws
  name: tf-aws-dynamodb-table
  namespace: vela-system
spec:
  schematic:
    terraform:
      configuration: https://github.com/Guidewire/terraform-aws-dynamodb-table.git
      providerRef:
        name: aws
        namespace: default
      type: remote
  workload:
    definition:
      apiVersion: terraform.core.oam.dev/v1beta1
      kind: Configuration
status: {}

Sample application

apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
  name: test
spec:
  components:
    - name: test
      type: tf-aws-dynamodb-table
      properties:
        name: "test"
        hash_key: "id"
        ttl_enabled: true
        ttl_attribute_name: "ts"
        autoscaling_enabled: true
        stream_enabled: true
        stream_view_type : "NEW_AND_OLD_IMAGES"
        attributes:
          - name: "id"
            type: "N"
        replica_regions:
          - region_name: us-east-1
          - region_name: us-west-1
        tags:
          Key: "Val"
  policies:
    - name: apply-once
      type: apply-once
      properties:
        enable: true
  workflow:
    steps:
      - name: create-dynamodb
        type: apply-component
        properties:
          component: test

anoop2811 commented 1 year ago

@shreyasHpandya : can you look at the logs of the kubevela pods in the vela-system namespace? Specifically any jobs that are spun to apply the terraform?

shreyasHpandya commented 1 year ago

My initial investigation leads me to believe the issue is in apply-component workflow step.

This error is however misleading since it is creating the Configuration object, running the apply job successfully and creating the dynamodb table as expected. The configuration object is also in ready state after sometime.

My guess is that the workflow step waits for a while to see if the Configuration object is in a ready state and fails if it is not. Since apply job takes a while to complete it does not come to ready state in duration expected by the workflow timeout.

I will look in the Kubevela code. @chivalryq can you help me with where I might look into the code for this issue?

chivalryq commented 1 year ago

It's uncommon for reporting no matches for kind "Configuration" in version "terraform.core.oam.dev/v1beta2" because this GVK has been registered to the underlying client in vela-core. I'm sure GVK terraform.core.oam.dev/v1beta2.Configuration has been registered.

Error is reported in this funciton.

https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/controller/core.oam.dev/v1beta1/application/apply.go#L262

Now this client is set here: https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/controller/core.oam.dev/v1beta1/application/apply.go#L80

It's reconciler's client. And reconciler's client comes from manager. Manager's client initialization process

https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/cmd/core/app/server.go#L135-L156

Scheme comes from common.Scheme. This is it: we can see terraform v1beta2 has been registered.

https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/utils/common/common.go#L73-L97

@shreyasHpandya You can start from the trace above to check why the client can't recognize the GVK since it has been registerd. The client or scheme could be replaced somewhere or the this client is re-assigned.

bugbounce commented 8 months ago

Hello @chivalryq ,

Thanks for the tip. It was very helpful. Debug setup:

Create a local cluster with k3d cluster create
run make core-install and make def-install from the Kubevela repo.
Start dlv debugger at cmd/core/main.go .

My understanding of controller-runtime client Scheme:

The runtime.Scheme maintains an in-memory mapping of GVK's to go types and vice-versa as well as maps of known GV's and any unstructured types. It consults the RESTMapper to figure out the kube API endpoints and at a high level acts as an intermediary cache between the client and kube-api-server by maintaining the current spec mapping of all available GVK's . Sort of. I am yet unclear on who is responsible for actually creating new CRD's from imported controllers in k8s and also when should this happen. For example, in https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/utils/common/common.go#L73-L97

, when the terraform api's are imported, and the terraform-controller init() is called, should we expect the GVK to be visible via kubectl ? If not, when ? Time delay doesn't look to be a factor.

https://github.com/kubevela/terraform-controller/blob/966471af19a07ffe94159e231899a0983a71c188/api/v1beta2/configuration_types.go#L191-L193

Observations:

Once vela-core is executed, controllers for all built-in GVK's come up. The managers scheme has registered both the terraform GVK's v1beta1 and v1beta2. The Configuration CRD is not yet applied and visible via kubectl.
Apply the ComponentDefinition and Application mentioned above in the original report. The Reconciler's client and Scheme both are the same as the managers Scheme. The Scheme includes :
- Both terraform.core.oam.dev/v1beta1 and terraform.core.oam.dev/v1beta2 in observedVersions .
- Atleast terraform.core.oam.dev/v1beta1/Configuration kind mapping in gvkToType and typeToGVK maps. The terraform.core.oam.dev/v1beta2 GV also has some kinds listed. Both gvkToType and typeToGVK have hundreds of listings, so I might have missed the Configuration for terraform.core.oam.dev/v1beta2 .
The reconciler seems to parse and properly generate the appfile. NoKindMatchError for Configuration is thrown at https://github.com/kubevela/kubevela/blob/1a001e5b29da766f8272a5f3f99b215c3fb13a7a/pkg/utils/apply/apply.go#L286-L312
The workflow fails with error run step(provider=oam,do=component-apply): Dispatch: pre-dispatch dryrun failed: Found 1 errors. [(cannot get object: no matches for kind "Configuration" in version "terraform.core.oam.dev/v1beta1")] . Slightly different error if pre-dispatch checks are disabled.
In my local setup, the flow never seems to reach https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/controller/core.oam.dev/v1beta1/application/apply.go#L262
As far as I can see the Scheme seems to be consistent throughout we don't seem to be interfering with its state anywhere. Not sure why this is intermittent in our Prod deployments.

Any advice would be great. Thanks in advance.

bugbounce commented 8 months ago

On further investigation, the *runtime.Scheme doesn't seem to have anything to do with installing the terraform-controller CRD's. We will try ensuring enough time delay between installing the CRD's and adding a terraform schematic Application.

anoop2811 commented 8 months ago

@chivalryq : do you have any ideas on this?

bugbounce commented 7 months ago

We were able to test and resolve this tentatively by ensuring that the terraform-controller CRD's are installed before vela-core boots up. Our best guess is that this is a controller-runtime issue. There seem to existing issues around controller-runtime cache if the CRD's are not installed: https://github.com/kubernetes-sigs/controller-runtime/issues/2456 https://github.com/kubernetes-sigs/controller-runtime/issues/2589 https://github.com/kubernetes-sigs/controller-runtime/issues/1759

Additional log trace for anyone else running into this:


apply.go:412] "[Finished]: i-lrlc854n.apply-policies(finish apply policies)" application="default/test-bin-repo-app" controller="application" resource_version="160683" generation=1 publish_version="alpha1" duration="1.63µs" spanID="i-lrlc854n.apply-policies"
I0304 15:41:10.001743       1 generator.go:76] "[Finished]: i-lrlc854n.generate-task-runners(finish generate task runners)" application="default/test-bin-repo-app" controller="application" resource_version="160683" generation=1 publish_version="alpha1" duration="116.981µs" spanID="i-lrlc854n.generate-task-runners"
I0304 15:41:10.012291       1 assemble.go:69] "Successfully assemble a workload" workload="default/test-bin-repo-app" APIVersion="terraform.core.oam.dev/v1beta2" Kind="Configuration"
I0304 15:41:10.020192       1 apply.go:126] "skip update" name="test-bin-repo-app" resource="terraform.core.oam.dev/v1beta2, Kind=Configuration"
I0304 15:41:10.046566       1 apply.go:126] "skip update" name="test-bin-repo-app" resource="terraform.core.oam.dev/v1beta2, Kind=Configuration"
E0304 15:41:10.046801       1 task.go:252] "do steps" err="run step(provider=oam,do=component-apply): CollectHealthStatus: app=test-bin-repo-app, comp=test-bin-repo-app, check health error: no matches for kind \"Configuration\" in version \"terraform.core.oam.dev/v1beta2\"" application="default/test-bin-repo-app" controller="application" resource_version="160683" generation=1 publish_version="alpha1" step_name="test-bin-repo-app" step_type="builtin-apply-component" spanID="i-lrlc854n.execute application workflow.efrta1kpup"```

kubevela / kubevela

Apply component workflow fails for component of terraform schematic #6313