opdev / opcap

Apache License 2.0
3 stars 15 forks source link

fixing CleanUp issues with Operand #347

Closed yashoza19 closed 1 year ago

yashoza19 commented 1 year ago

Signed-off-by: yoza yoza@redhat.com

Description of PR

Fixes #346

Changes (required)

Checklist (required)

acmenezes commented 1 year ago

/lgtm

acmenezes commented 1 year ago

/lgtm

madorn commented 1 year ago

Although this patch appears to work most of the time, I saw the following error..which after searching around appears to be a race condition...trying to see if i can replicate it again.

$ opcap check --audit-plan=OperatorInstall,OperandInstall --catalogsource=certified-operators --catalogsourcenamespace=openshift-marketplace --packages confluent-for-kubernetes

{"level":"error","ts":1674523424.1474535,"caller":"logger/logger.go:62","message":"CleanerName: github.com/opdev/opcap/internal/capability.operandCleanup.func1 ---- cleanup failed: Operation cannot be fulfilled on kafkas.platform.confluent.io \"kafka\": StorageError: invalid object, Code: 4, Key: /kubernetes.io/platform.confluent.io/kafkas/opcap-confluent-for-kubernetes-ownnamespace/kafka, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 5d814f39-0030-4413-be60-2d6747b0fc11, UID in object meta: ","stacktrace":"github.com/opdev/opcap/internal/logger.Errorf\n\t/vagrant_data/opcap/internal/logger/logger.go:62\ngithub.com/opdev/opcap/internal/capability.cleanup\n\t/vagrant_data/opcap/internal/capability/auditor.go:165\ngithub.com/opdev/opcap/internal/capability.RunAudits\n\t/vagrant_data/opcap/internal/capability/auditor.go:230\ngithub.com/opdev/opcap/cmd.runAudits\n\t/vagrant_data/opcap/cmd/check.go:78\ngithub.com/opdev/opcap/cmd.checkRunE\n\t/vagrant_data/opcap/cmd/check.go:73\ngithub.com/spf13/cobra.(*Command).execute\n\t/home/vagrant/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:916\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/home/vagrant/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:1044\ngithub.com/spf13/cobra.(*Command).Execute\n\t/home/vagrant/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:968\ngithub.com/spf13/cobra.(*Command).ExecuteContext\n\t/home/vagrant/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:961\ngithub.com/opdev/opcap/cmd.Execute\n\t/vagrant_data/opcap/cmd/root.go:44\nmain.main\n\t/vagrant_data/opcap/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

madorn commented 1 year ago

I saw it again on hazelcast-platform-operator (contains multiple CRs):

{"level":"error","ts":1674525568.5580864,"caller":"logger/logger.go:62","message":"CleanerName: github.com/opdev/opcap/internal/capability.operandCleanup.func1 ---- cleanup failed: Operation cannot be fulfilled on cronhotbackups.hazelcast.com \"cronhotbackup-sample\": StorageError: invalid object, Code: 4, Key: /kubernetes.io/hazelcast.com/cronhotbackups/opcap-hazelcast-platform-operator-ownnamespace/cronhotbackup-sample, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 19cab4ea-62e1-4c6c-8884-1bfa50f47988, UID in object meta: ","stacktrace":"github.com/opdev/opcap/internal/logger.Errorf\n\t/vagrant_data/opcap/internal/logger/logger.go:62\ngithub.com/opdev/opcap/internal/capability.cleanup\n\t/vagrant_data/opcap/internal/capability/auditor.go:165\ngithub.com/opdev/opcap/internal/capability.RunAudits\n\t/vagrant_data/opcap/internal/capability/auditor.go:230\ngithub.com/opdev/opcap/cmd.runAudits\n\t/vagrant_data/opcap/cmd/check.go:78\ngithub.com/opdev/opcap/cmd.checkRunE\n\t/vagrant_data/opcap/cmd/check.go:73\ngithub.com/spf13/cobra.(*Command).execute\n\t/home/vagrant/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:916\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/home/vagrant/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:1044\ngithub.com/spf13/cobra.(*Command).Execute\n\t/home/vagrant/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:968\ngithub.com/spf13/cobra.(*Command).ExecuteContext\n\t/home/vagrant/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:961\ngithub.com/opdev/opcap/cmd.Execute\n\t/vagrant_data/opcap/cmd/root.go:44\nmain.main\n\t/vagrant_data/opcap/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

yashoza19 commented 1 year ago

I saw it again on hazelcast-platform-operator (contains multiple CRs):

Thanks for checking @madorn. I did a complete run - and found more issues that I'm trying to resolve. Will update this PR with what I find.

yashoza19 commented 1 year ago

I tested the hazelcast-platform-operator with this PR, looks like it is deleting all other resource except this CR. Somehow even after deleting the finalizer, it keeps adding the finalizer unless the operator itself is not deleted. Here are the controller logs:

{"level":"info","ts":1674533308.0610213,"logger":"controllers.WanReplication","msg":"Adding finalizer","name":"wanreplication-sample","namespace":"opcap-hazelcast-platform-operator-ownnamespace/wanreplication-sample"}
{"level":"info","ts":1674533308.0675478,"logger":"wanreplication-resource","msg":"validate update","name":"wanreplication-sample"}
{"level":"error","ts":1674533308.0800261,"msg":"Reconciler error","controller":"wanreplication","controllerGroup":"hazelcast.com","controllerKind":"WanReplication","WanReplication":{"name":"wanreplication-sample","namespace":"opcap-hazelcast-platform-operator-ownnamespace"},"namespace":"opcap-hazelcast-platform-operator-ownnamespace","name":"wanreplication-sample","reconcileID":"60399d53-3e82-4194-8b45-01357591cb17","error":"Map.hazelcast.com \"map-sample\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/cont...

Need to investigate further into this.

acmenezes commented 1 year ago

@madorn , it looks it was just a logic error. We're skipping part of the deletion because some objects were being deleted as a consequence of others. It looked like secondary resources disappearing. The code wasn't expecting that. I think now Yash has it fixed.

mrhillsman commented 1 year ago

/approve

exe-prow-github-app[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: acmenezes, acornett21, mrhillsman

Associated issue: #346

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/opdev/opcap/blob/main/OWNERS)~~ [acmenezes,mrhillsman] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment