Open shreyasHpandya opened 1 year ago
@shreyasHpandya : can you look at the logs of the kubevela pods in the vela-system namespace? Specifically any jobs that are spun to apply the terraform?
My initial investigation leads me to believe the issue is in apply-component workflow step.
This error is however misleading since it is creating the Configuration object, running the apply job successfully and creating the dynamodb table as expected. The configuration object is also in ready state after sometime.
My guess is that the workflow step waits for a while to see if the Configuration object is in a ready state and fails if it is not. Since apply job takes a while to complete it does not come to ready state in duration expected by the workflow timeout.
I will look in the Kubevela code. @chivalryq can you help me with where I might look into the code for this issue?
It's uncommon for reporting no matches for kind "Configuration" in version "terraform.core.oam.dev/v1beta2"
because this GVK has been registered to the underlying client in vela-core. I'm sure GVK terraform.core.oam.dev/v1beta2.Configuration
has been registered.
Error is reported in this funciton.
Now this client is set here: https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/controller/core.oam.dev/v1beta1/application/apply.go#L80
It's reconciler's client. And reconciler's client comes from manager. Manager's client initialization process
Scheme comes from common.Scheme. This is it: we can see terraform v1beta2 has been registered.
@shreyasHpandya You can start from the trace above to check why the client can't recognize the GVK since it has been registerd. The client or scheme could be replaced somewhere or the this client is re-assigned.
Hello @chivalryq ,
Thanks for the tip. It was very helpful. Debug setup:
k3d cluster create
make core-install
and make def-install
from the Kubevela repo.cmd/core/main.go
. My understanding of controller-runtime client Scheme:
The runtime.Scheme maintains an in-memory mapping of GVK's to go types and vice-versa as well as maps of known GV's and any unstructured types. It consults the RESTMapper to figure out the kube API endpoints and at a high level acts as an intermediary cache between the client and kube-api-server by maintaining the current spec mapping of all available GVK's . Sort of. I am yet unclear on who is responsible for actually creating new CRD's from imported controllers in k8s and also when should this happen. For example, in https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/utils/common/common.go#L73-L97
, when the terraform api's are imported, and the terraform-controller init()
is called, should we expect the GVK to be visible via kubectl ? If not, when ? Time delay doesn't look to be a factor.
Observations:
Once vela-core is executed, controllers for all built-in GVK's come up. The managers scheme has registered both the terraform GVK's v1beta1
and v1beta2
. The Configuration CRD is not yet applied and visible via kubectl.
Apply the ComponentDefinition and Application mentioned above in the original report. The Reconciler
's client and Scheme both are the same as the managers Scheme. The Scheme includes :
terraform.core.oam.dev/v1beta1
and terraform.core.oam.dev/v1beta2
in observedVersions
.terraform.core.oam.dev/v1beta1/Configuration
kind mapping in gvkToType
and typeToGVK
maps. The terraform.core.oam.dev/v1beta2
GV also has some kinds listed. Both gvkToType
and typeToGVK
have hundreds of listings, so I might have missed the Configuration
for terraform.core.oam.dev/v1beta2
.The reconciler seems to parse and properly generate the appfile
. NoKindMatchError
for Configuration
is thrown at https://github.com/kubevela/kubevela/blob/1a001e5b29da766f8272a5f3f99b215c3fb13a7a/pkg/utils/apply/apply.go#L286-L312
The workflow fails with error run step(provider=oam,do=component-apply): Dispatch: pre-dispatch dryrun failed: Found 1 errors. [(cannot get object: no matches for kind "Configuration" in version "terraform.core.oam.dev/v1beta1")]
. Slightly different error if pre-dispatch checks are disabled.
In my local setup, the flow never seems to reach https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/controller/core.oam.dev/v1beta1/application/apply.go#L262
As far as I can see the Scheme seems to be consistent throughout we don't seem to be interfering with its state anywhere. Not sure why this is intermittent in our Prod deployments.
Any advice would be great. Thanks in advance.
On further investigation, the *runtime.Scheme
doesn't seem to have anything to do with installing the terraform-controller
CRD's. We will try ensuring enough time delay between installing the CRD's and adding a terraform
schematic Application.
@chivalryq : do you have any ideas on this?
We were able to test and resolve this tentatively by ensuring that the terraform-controller CRD's are installed before vela-core
boots up. Our best guess is that this is a controller-runtime
issue. There seem to existing issues around controller-runtime cache if the CRD's are not installed:
https://github.com/kubernetes-sigs/controller-runtime/issues/2456
https://github.com/kubernetes-sigs/controller-runtime/issues/2589
https://github.com/kubernetes-sigs/controller-runtime/issues/1759
Additional log trace for anyone else running into this:
apply.go:412] "[Finished]: i-lrlc854n.apply-policies(finish apply policies)" application="default/test-bin-repo-app" controller="application" resource_version="160683" generation=1 publish_version="alpha1" duration="1.63µs" spanID="i-lrlc854n.apply-policies"
I0304 15:41:10.001743 1 generator.go:76] "[Finished]: i-lrlc854n.generate-task-runners(finish generate task runners)" application="default/test-bin-repo-app" controller="application" resource_version="160683" generation=1 publish_version="alpha1" duration="116.981µs" spanID="i-lrlc854n.generate-task-runners"
I0304 15:41:10.012291 1 assemble.go:69] "Successfully assemble a workload" workload="default/test-bin-repo-app" APIVersion="terraform.core.oam.dev/v1beta2" Kind="Configuration"
I0304 15:41:10.020192 1 apply.go:126] "skip update" name="test-bin-repo-app" resource="terraform.core.oam.dev/v1beta2, Kind=Configuration"
I0304 15:41:10.046566 1 apply.go:126] "skip update" name="test-bin-repo-app" resource="terraform.core.oam.dev/v1beta2, Kind=Configuration"
E0304 15:41:10.046801 1 task.go:252] "do steps" err="run step(provider=oam,do=component-apply): CollectHealthStatus: app=test-bin-repo-app, comp=test-bin-repo-app, check health error: no matches for kind \"Configuration\" in version \"terraform.core.oam.dev/v1beta2\"" application="default/test-bin-repo-app" controller="application" resource_version="160683" generation=1 publish_version="alpha1" step_name="test-bin-repo-app" step_type="builtin-apply-component" spanID="i-lrlc854n.execute application workflow.efrta1kpup"```
Apply component workflow fails for a component of schematic type terraform
We have a custom component of schematic type terraform which creates dynamodb table. We create tables by applying applications that use this component, and use workflow
apply-component
.After few seconds the workflow fails with following error
run step(provider=oam,do=component-apply): CollectHealthStatus: app=test, comp=<redacted>, check health error: no matches for kind "Configuration" in version "terraform.core.oam.dev/v1beta2"
This error is however misleading since it is creating the
Configuration
object, running the apply job successfully and creating the dynamodb table as expected. The configuration object is also in ready state after sometime.My guess is that the workflow step waits for a while to see if the
Configuration
object is in a ready state and fails if it is not. Since apply job takes a while to complete it does not come to ready state in duration expected by the workflow timeout.Kubevela Version 1.9.6
Component definition
Sample application