OLM CI Tracking - Githubissues

timflannagan commented 3 years ago

CI Improvements

[x] #2413
- assignee: @njhale
[ ] Automatically rebase/retest PRs when master changes
[ ] Aggregate test failure rates against master
[ ] Detect controller hotlooping
[ ] Detect memory leaks

Controller Improvements

[ ] #2410
- assignee: @exdx

Flakes

[x] "Install Plan creation with pre existing CRD owners PreExistingCRDOwnerIsReplaced"
- This test case was removed entirely in https://github.com/operator-framework/operator-lifecycle-manager/pull/2392
[x] "Subscription with starting CSV"
- Fixed in https://github.com/operator-framework/operator-lifecycle-manager/pull/2392
[x] "ClusterServiceVersion emits CSV requirement events"
- Fixed in https://github.com/operator-framework/operator-lifecycle-manager/pull/2385
[ ] #2417
- https://github.com/operator-framework/operator-lifecycle-manager/pull/2396 attempted to solve the issue but it's still popping up
[ ] #2409
- assignee: @exdx
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_operator-framework-olm/196/pull-ci-openshift-operator-framework-olm-master-e2e-aws-olm/1446125274558107648
- https://github.com/operator-framework/operator-lifecycle-manager/pull/2414/checks?check_run_id=3833989718
[x] #2405
- https://github.com/operator-framework/operator-lifecycle-manager/runs/3821548502?check_suite_focus=true#step:4:8419
- assignee: @awgreene
[ ] #2411
- assignee: @exdx
[x] #2412
- assignee: @exdx
[x] #2408
- https://github.com/operator-framework/operator-lifecycle-manager/runs/3817364227?check_suite_focus=true#step:4:9481
- assignee: @estroz
[ ] #2407
- https://github.com/operator-framework/operator-lifecycle-manager/runs/3821548502?check_suite_focus=true#step:4:8422
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_operator-framework-olm/196/pull-ci-openshift-operator-framework-olm-master-e2e-aws-olm/1446125274558107648
- assignee: @tylerslaton
[ ] "Install Plan when an InstallPlan is created with no valid OperatorGroup present should clear clear up the condition in the InstallPlan status that contains an error message when a valid OperatorGroup is created"
[ ] "Subscription when A subscription is created for an operator that requires an API that is not available when the required API is made available the ResolutionFailed condition previously set in it's status that indicated the resolution error is cleared off"
- Note: this appears to be fairly reproducible locally running make e2e-local E2E_SEED=1633621246 and focusing on the Describe("Subscription") top-level spec.
- The gRPC unknown state maps to state number 2: Got source event: grpc.SourceState{Key:registry.CatalogKey{Name:\"test-catalog-fwgtr\", Namespace:\"operators\"}, State:2}.
- When poking around that test-catalog-fwgtr CatalogSource locally, it's reporting a Ready .Status.LastObservedState, and no InstallPlan resource was able to be generated.
[ ] "Subscription when an entry in the middle of a channel does not provide a required GVK should create a Subscription for the latest entry providing the required GVK
[ ] "ClusterServiceVersion when a csv exists specifying two replicas with one max unavailable remains in phase Succeeded when only one pod is available"
[ ] "Subscription can reconcile InstallPlan status"
[ ] "Operator Group cleanup csvs with bad owner operator groups"
[ ] "Catalog represents a store of bundles which OLM can use to install Operators adding catalog template adjusts image used"
[ ] "Catalog represents a store of bundles which OLM can use to install Operators [AfterEach] gRPC address catalog source"
[ ] #2440
[ ] "Subscription creation with dependencies required and provided in different versions of an operator in the same package"
[ ] "CRD Versions allows a CRD upgrade that doesn't cause data loss"
[ ] "Metrics are generated for OLM managed resources/a CSV is created/the OLM pod restarts"
[ ] "Subscription creation with pod config"
- https://github.com/operator-framework/operator-lifecycle-manager/pull/2414/checks?check_run_id=3833989718
[ ] "Subscription creation in case of transferring providedAPIs"
[ ] "Garbage collection for dependent resources when a bundle with a configmap is installed [BeforeEach] when the subscription is updated to a later CSV with a configmap with the same name but new data OLM should have upgraded associated configmap in place"
[ ] "Subscription creation manual approval"
[ ] "creation with pod config"
[ ] "static provider"
[ ] "should surface components in its status"
- https://github.com/operator-framework/operator-lifecycle-manager/runs/4159300857?check_suite_focus=true
[ ] "creation using existing CSV"
[x] #2441
[ ] "ClusterServiceVersion when a CustomResourceDefinition was installed alongside a ClusterServiceVersion can satisfy an associated ClusterServiceVersion's ownership requirement [AfterEach]"
- https://github.com/operator-framework/operator-lifecycle-manager/runs/4155819252?check_suite_focus=true

Misc/Needs Home/Triage/etc.

[x] #2382
[ ] #2386
[ ] #2400
[ ] #2402
[x] #2393

timflannagan commented 3 years ago

Note: the "Garbage collection for dependent resources when a bundle with configmap and secret objects is installed when the CSV is deleted OLM ..." test blocks are increasingly reproducible. When poking around the "should have removed the old configmap and put the new configmap in place" test, it appears there's some hotlooping in the catalog operator when attempting to process a Subscription that previously failed resolution, and contention attempting to always remove that status condition when firing off blind Update calls.

timflannagan commented 3 years ago

Misc: the need for an automatic rebasing mechanism for open PRs once a new PR has been merged from master.

timflannagan commented 3 years ago

Misc: the need for updating the test provisioner to also attempt to gather testing artifacts before deleting the cluster.

timflannagan commented 3 years ago

Misc: seeing quite a bit of connection-refused logs in the catalog-operator when firing off ListBundles calls:

E1006 17:13:09.466730       1 queueinformer_operator.go:290] sync "operators" failed: [error using catalog test-catalog (in namespace operators): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.133.46:50051: connect: connection refused", error using catalog operatorhubio-catalog (in namespace operator-lifecycle-manager): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.33.109:50051: connect: connection refused"]

timflannagan commented 3 years ago

https://github.com/operator-framework/operator-lifecycle-manager/issues/2420 - another quality of life issue when running e2e locally.

exdx commented 3 years ago

There's occasionally a panic in the TestConnectionEvents series of unit tests where a 10 minute timeout occurs. This is seen in https://github.com/operator-framework/operator-lifecycle-manager/pull/2425/checks?check_run_id=3899261291

akihikokuroda commented 2 years ago

As of today (01/21/2022), I see the following e2e failures.

should have copied CSVs in all other Namespaces
should create a Subscription for the latest entry providing the required GVK
delete internal registry pod triggers recreation
can satisfy an associated ClusterServiceVersion's ownership requirement
should surface components in its status

In addition to these, I see some failures that are caused by the installplan creation wait timeout. They have the following in the test log.

waiting for catalog pod scoped-catsrc-hzt42 to be available (for sync) - TRANSIENT_FAILURE
catalog scoped-catsrc-hzt42 pod with address scoped-catsrc-hzt42.scoped-ns-cfw9r.svc:50051
03:47:22.1316:  (): nil
waiting for scoped-sub-wz8bw to have installplan ref
03:47:23.131:  (): nil
waiting for scoped-sub-wz8bw to have installplan ref
03:47:24.1319:  (): nil
waiting for scoped-sub-wz8bw to have installplan ref
03:47:25.1315:  (): nil
waiting for scoped-sub-wz8bw to have installplan ref

.........

waiting for scoped-sub-wz8bw to have installplan ref
03:52:21.1343: never got correct status: v1alpha1.SubscriptionStatus{CurrentCSV:"", InstalledCSV:"", Install:(*v1alpha1.InstallPlanReference)(nil), State:"", Reason:"", InstallPlanGeneration:0, InstallPlanRef:(*v1.ObjectReference)(nil), CatalogHealth:

I'll open issues for them later.

operator-framework / operator-lifecycle-manager

OLM CI Tracking #2401

CI Improvements

Controller Improvements

Flakes

Misc/Needs Home/Triage/etc.