operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.7k stars 542 forks source link

OLM CI Tracking #2401

Open timflannagan opened 2 years ago

timflannagan commented 2 years ago

CI Improvements

Controller Improvements

Flakes

Misc/Needs Home/Triage/etc.

timflannagan commented 2 years ago

Note: the "Garbage collection for dependent resources when a bundle with configmap and secret objects is installed when the CSV is deleted OLM ..." test blocks are increasingly reproducible. When poking around the "should have removed the old configmap and put the new configmap in place" test, it appears there's some hotlooping in the catalog operator when attempting to process a Subscription that previously failed resolution, and contention attempting to always remove that status condition when firing off blind Update calls.

timflannagan commented 2 years ago

Misc: the need for an automatic rebasing mechanism for open PRs once a new PR has been merged from master.

timflannagan commented 2 years ago

Misc: the need for updating the test provisioner to also attempt to gather testing artifacts before deleting the cluster.

timflannagan commented 2 years ago

Misc: seeing quite a bit of connection-refused logs in the catalog-operator when firing off ListBundles calls:

E1006 17:13:09.466730       1 queueinformer_operator.go:290] sync "operators" failed: [error using catalog test-catalog (in namespace operators): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.133.46:50051: connect: connection refused", error using catalog operatorhubio-catalog (in namespace operator-lifecycle-manager): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.33.109:50051: connect: connection refused"]
timflannagan commented 2 years ago

https://github.com/operator-framework/operator-lifecycle-manager/issues/2420 - another quality of life issue when running e2e locally.

exdx commented 2 years ago

There's occasionally a panic in the TestConnectionEvents series of unit tests where a 10 minute timeout occurs. This is seen in https://github.com/operator-framework/operator-lifecycle-manager/pull/2425/checks?check_run_id=3899261291

akihikokuroda commented 2 years ago

As of today (01/21/2022), I see the following e2e failures.

In addition to these, I see some failures that are caused by the installplan creation wait timeout. They have the following in the test log.

waiting for catalog pod scoped-catsrc-hzt42 to be available (for sync) - TRANSIENT_FAILURE
catalog scoped-catsrc-hzt42 pod with address scoped-catsrc-hzt42.scoped-ns-cfw9r.svc:50051
03:47:22.1316:  (): nil
waiting for scoped-sub-wz8bw to have installplan ref
03:47:23.131:  (): nil
waiting for scoped-sub-wz8bw to have installplan ref
03:47:24.1319:  (): nil
waiting for scoped-sub-wz8bw to have installplan ref
03:47:25.1315:  (): nil
waiting for scoped-sub-wz8bw to have installplan ref

.........

waiting for scoped-sub-wz8bw to have installplan ref
03:52:21.1343: never got correct status: v1alpha1.SubscriptionStatus{CurrentCSV:"", InstalledCSV:"", Install:(*v1alpha1.InstallPlanReference)(nil), State:"", Reason:"", InstallPlanGeneration:0, InstallPlanRef:(*v1.ObjectReference)(nil), CatalogHealth:

I'll open issues for them later.