opdev / opcap

Apache License 2.0
3 stars 15 forks source link

opcap fails to delete namespaces created for operators. #329

Closed acmenezes closed 1 year ago

acmenezes commented 1 year ago

Bug Description

opcap fails to delete namespaces between individual operator audits

Version and Command Invocation

v0.2.0 opcap check

Steps to Reproduce:

1) Running against a full size cluster opcap check

Expected Result

All resources created by opcap to be deleted after each operator audit.

Actual Result

Multiple audits for individual operators throw the following error:

{"level":"error","ts":1670553181.705024,"caller":"logger/logger.go:62","message":"cleanup failed: could not delete namespace: opcap-infoscale-licensing-operator-allnamespaces: Internal error occurred: admission plugin \"ValidatingAdmissionWebhook\" failed to complete validation in 13s","stacktrace":"github.com/opdev/opcap/internal/logger.Errorf\n\t/home/alex/go/src/github.com/acmenezes/opcap/internal/logger/logger.go:62\ngithub.com/opdev/opcap/internal/capability.cleanup\n\t/home/alex/go/src/github.com/acmenezes/opcap/internal/capability/auditor.go:162\ngithub.com/opdev/opcap/internal/capability.RunAudits\n\t/home/alex/go/src/github.com/acmenezes/opcap/internal/capability/auditor.go:226\ngithub.com/opdev/opcap/cmd.runAudits\n\t/home/alex/go/src/github.com/acmenezes/opcap/cmd/check.go:76\ngithub.com/opdev/opcap/cmd.checkRunE\n\t/home/alex/go/src/github.com/acmenezes/opcap/cmd/check.go:71\ngithub.com/spf13/cobra.(*Command).execute\n\t/home/alex/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:916\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/home/alex/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:1044\ngithub.com/spf13/cobra.(*Command).Execute\n\t/home/alex/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:968\ngithub.com/spf13/cobra.(*Command).ExecuteContext\n\t/home/alex/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:961\ngithub.com/opdev/opcap/cmd.Execute\n\t/home/alex/go/src/github.com/acmenezes/opcap/cmd/root.go:44\nmain.main\n\t/home/alex/go/src/github.com/acmenezes/opcap/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

Additional Context

The cause can be related to timing issues like trying to delete or create resources too fast and/or related to finalizers that are not being removed for an unknown reason and preventing the cluster to finish the delete operation.

madorn commented 1 year ago

We are relying on the deletion of the Namespace to cleanup the lingering Operator CSV and associated Operator controller Deployment. This can often result in Namespace stuck in Terminating status when Namespace controller attempts resource cleanup.

Per discussion with @acmenezes and @bcrochet, let's add an explicit deletion of the Operator CSV immediately after options.client.DeleteSubscription in the operator_cleanup.go.

acmenezes commented 1 year ago

We are relying on the deletion of the Namespace to cleanup the lingering Operator CSV and associated Operator controller Deployment. This can often result in Namespace stuck in Terminating status when Namespace controller attempts resource cleanup.

Per discussion with @acmenezes and @bcrochet, let's add an explicit deletion of the Operator CSV immediately after options.client.DeleteSubscription in the operator_cleanup.go.

Right @madorn I'll investigate that option. Although it looks like an intermittent problem. I was able to run it in full this afternoon with all namespaces being cleared correctly and all resources cleaned up well.

bcrochet commented 1 year ago

Should also check that operands are being deleted. Currently, the deletion is fire and forget. Could implement a goroutine to fire off for each custom resource, and wait for completion or a time out.

madorn commented 1 year ago

Opened up #337 per @bcrochet's suggestion.