Test workloads may not clean up properly after each test

kaovilai commented 10 months ago

https://github.com/openshift/oadp-operator/pull/1223

reportAfterEach which contained cleanup code was refactored recently and maybe some bugs were added?

https://github.com/openshift/oadp-operator/pull/1223/files#diff-f7f90d1b403b9620176bb058724a225eacdbf0fb56df0c9f77f7cd4f3152f917L68

    var _ = ReportAfterEach(func(report SpecReport) {
        if report.State == types.SpecStateSkipped || report.State == types.SpecStatePending {
            // do not run if the test is skipped
            return
        }
        GinkgoWriter.Println("Report after each: state: ", report.State.String())
        if report.Failed() {
            // print namespace error events for app namespace
            if lastBRCase.ApplicationNamespace != "" {
                GinkgoWriter.Println("Printing app namespace events")
                PrintNamespaceEventsAfterTime(kubernetesClientForSuiteRun, lastBRCase.ApplicationNamespace, lastInstallTime)
            }
            GinkgoWriter.Println("Printing oadp namespace events")
            PrintNamespaceEventsAfterTime(kubernetesClientForSuiteRun, namespace, lastInstallTime)
            baseReportDir := artifact_dir + "/" + report.LeafNodeText
            err := os.MkdirAll(baseReportDir, 0755)
            Expect(err).NotTo(HaveOccurred())
            err = SavePodLogs(kubernetesClientForSuiteRun, namespace, baseReportDir)
            Expect(err).NotTo(HaveOccurred())
            err = SavePodLogs(kubernetesClientForSuiteRun, lastBRCase.ApplicationNamespace, baseReportDir)
            Expect(err).NotTo(HaveOccurred())
        }
        // remove app namespace if leftover (likely previously failed before reaching uninstall applications) to clear items such as PVCs which are immutable so that next test can create new ones
        err := dpaCR.Client.Delete(context.Background(), &corev1.Namespace{ObjectMeta: v1.ObjectMeta{
            Name:      lastBRCase.ApplicationNamespace,
            Namespace: lastBRCase.ApplicationNamespace,
        }}, &client.DeleteOptions{})
        if k8serror.IsNotFound(err) {
            err = nil
        }
        Expect(err).ToNot(HaveOccurred())

        err = dpaCR.Delete(runTimeClientForSuiteRun)
        Expect(err).ToNot(HaveOccurred())
        Eventually(IsNamespaceDeleted(kubernetesClientForSuiteRun, lastBRCase.ApplicationNamespace), timeoutMultiplier*time.Minute*2, time.Second*5).Should(BeTrue())
    })

kaovilai commented 10 months ago

Seeing this moved into func tearDownBackupAndRestore(

mateusoliveira43 commented 10 months ago

I changed all ReportAfterEach to AfterEach

Example: https://github.com/openshift/oadp-operator/blob/master/tests/e2e/backup_restore_suite_test.go#L263-L65

kaovilai commented 10 months ago

    var _ = AfterEach(func(ctx SpecContext) {
        tearDownBackupAndRestore(lastBRCase, lastInstallTime, ctx.SpecReport())
    })

Being run here.

This should still not run on anyting that is skipped.

        if report.State == types.SpecStateSkipped || report.State == types.SpecStatePending {
            // do not run if the test is skipped
            return
        }

I'm not seeing SpecStateSkipped added

mateusoliveira43 commented 10 months ago

I think this is not needed for AfterEach. Docs for reportAfterEach "ReportAfterEach nodes are run for each spec, even if the spec is skipped or pending."

kaovilai commented 10 months ago

not sure exactly what I spotted but it was along the lines of failures related to an existing workload deployment already exists blocking new creation in logs somewhere.

kaovilai commented 10 months ago

I will run e2e and check cleanup still works later.. and will close this if so

mateusoliveira43 commented 10 months ago

with the debug logs, it should be easy to catch missed after each calls, they appear saying something like "entering after each"

mateusoliveira43 commented 10 months ago

example problem https://github.com/openshift/oadp-operator/pull/1239#issuecomment-1832048043

kaovilai commented 10 months ago

example 2 https://github.com/openshift/oadp-operator/pull/1218#issuecomment-1832515977 There are 9 results with - "velero.io/backup-name": string(" which is part of the code that shows unexpected label diff with an existing workload that was not cleanup.

kaovilai commented 10 months ago

The following backup name label diff should not (ie. object in cluster should not already exists) occur on e2e-app creation if cleanup was done.

diff found for key: metadata
    map[string]any{
        "creationTimestamp": string("2023-11-27T23:06:24Z"),
        "generation":        int64(1),
        "labels": map[string]any{
            "e2e-app":                string("true"),
  -         "velero.io/backup-name":  string("mongo-datamover-e2e-3ec695a1-8d79-11ee-8f5d-0a580a83a32f"),
  -         "velero.io/restore-name": string("mongo-datamover-e2e-3ec695aa-8d79-11ee-8f5d-0a580a83a32f"),

mpryc commented 10 months ago

/label CI

openshift-ci[bot] commented 10 months ago

@mpryc: The label(s) /label CI cannot be applied. These labels are supported: acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, downstream-change-needed, rebase/manual, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/valid-bug, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to [this](https://github.com/openshift/oadp-operator/issues/1244#issuecomment-1833674087): >/label CI Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

openshift-bot commented 7 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kaovilai commented 7 months ago

/lifecycle frozen

kaovilai commented 7 months ago

Looks like we're not cleaning up SCC. This log kept popping up when installing apps

updating resource: security.openshift.io/v1, Kind=SecurityContextConstraints; name: mongo-persistent-scc

sseago commented 7 months ago

@kaovilai yeah, those are cluster-scoped, so namespace removal won't get rid of them.

openshift / oadp-operator

Test workloads may not clean up properly after each test #1244