operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.72k stars 545 forks source link

e2e InstallPlan failed on installplan_e2e_test.go:3242 "should clear clear up the condition in the InstallPlan status that contains an error message when a valid OperatorGroup is created" #2516

Open akihikokuroda opened 2 years ago

akihikokuroda commented 2 years ago

Bug Report

Running Suite: End-to-end
=========================
Random Seed: 1639156900
Will run 1 of 169 specs

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
Install Plan when an InstallPlan is created with no valid OperatorGroup present 
  should clear clear up the condition in the InstallPlan status that contains an error message when a valid OperatorGroup is created
  /Users/akihikokuroda/development/go/src/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:3193
[BeforeEach] when an InstallPlan is created with no valid OperatorGroup present
  /Users/akihikokuroda/development/go/src/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:3167
[It] should clear clear up the condition in the InstallPlan status that contains an error message when a valid OperatorGroup is created
  /Users/akihikokuroda/development/go/src/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:3193
installplan ip is in phase Installing
Waiting for OperatorGroup(og) to be synced with status.namespaces: []
Waiting for OperatorGroup(og) to be synced with status.namespaces: [ns-s2qdg]
installplan ip is in phase Installing
installplan ip is in phase Installing
.........
installplan ip is in phase Installing
installplan ip is in phase Installing
installplan ip is in phase Installing
[AfterEach] when an InstallPlan is created with no valid OperatorGroup present
  /Users/akihikokuroda/development/go/src/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:3245
[AfterEach] Install Plan
  /Users/akihikokuroda/development/go/src/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:55
cleaning up ephemeral test resources...
deleting test subscriptions...
deleting test installplans...
deleting test catalogsources...
deleting test crds...
deleting test csvs...
test resources deleted

• Failure [61.482 seconds]
Install Plan
/Users/akihikokuroda/development/go/src/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:54
  when an InstallPlan is created with no valid OperatorGroup present
  /Users/akihikokuroda/development/go/src/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:3159
    should clear clear up the condition in the InstallPlan status that contains an error message when a valid OperatorGroup is created [It]
    /Users/akihikokuroda/development/go/src/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:3193

    Timed out after 60.004s.
    Expected
        <bool>: false
    to be true

    /Users/akihikokuroda/development/go/src/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:3242
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

Summarizing 1 Failure:

[Fail] Install Plan when an InstallPlan is created with no valid OperatorGroup present [It] should clear clear up the condition in the InstallPlan status that contains an error message when a valid OperatorGroup is created 
/Users/akihikokuroda/development/go/src/operator-lifecycle-manager/test/e2e/installplan_e2e_test.go:3242

Ran 1 of 169 Specs in 61.594 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 168 Skipped
--- FAIL: TestEndToEnd (61.65s)
FAIL

Ginkgo ran 1 suite in 1m13.816506417s
Test Suite Failed
exit status 1

What did you do? While I debug https://github.com/operator-framework/operator-lifecycle-manager/issues/2509, this error sometimes happened.
This error happens more often by making the following code change:

diff --git a/pkg/lib/queueinformer/queueinformer_operator.go b/pkg/lib/queueinformer/queueinformer_operator.go
index e009fe221..0cad1738c 100644
--- a/pkg/lib/queueinformer/queueinformer_operator.go
+++ b/pkg/lib/queueinformer/queueinformer_operator.go
@@ -285,7 +285,7 @@ func (o *operator) processNextWorkItem(ctx context.Context, loop *QueueInformer)

        // Sync and requeue on error (throw out failed deletion syncs)
        err := loop.Sync(ctx, event)
-       if requeues := queue.NumRequeues(item); err != nil && requeues < 8 && event.Type() != kubestate.ResourceDeleted {
+       if requeues := queue.NumRequeues(item); err != nil && requeues < 5 && event.Type() != kubestate.ResourceDeleted {
                logger.WithField("requeues", requeues).Trace("requeuing with rate limiting")
                utilruntime.HandleError(errors.Wrap(err, fmt.Sprintf("sync %q failed", item)))
                queue.AddRateLimited(item)

What did you expect to see? No error

What did you see instead? Under which circumstances? Error above

Environment

Possible Solution

Additional context Add any other context about the problem here.

akihikokuroda commented 2 years ago

This issue happened again https://github.com/operator-framework/operator-lifecycle-manager/runs/4984740013?check_suite_focus=true#step:4:2252

Here is the logs from the CI e2e failure.

catalog-operator

2022-01-28T18:42:57.428597518Z stderr F time="2022-01-28T18:42:57Z" level=info msg=syncing id=JOTKl ip=ip namespace=ns-8n6hh phase=
2022-01-28T18:42:57.441645773Z stderr F time="2022-01-28T18:42:57Z" level=debug msg="handling object deletion" name=nginx-g7gnm-stable namespace=operators
2022-01-28T18:42:57.446831713Z stderr F time="2022-01-28T18:42:57Z" level=info msg="skip processing installplan without status - subscription sync responsible for initial status" id=JOTKl ip=ip namespace=ns-8n6hh phase=
2022-01-28T18:42:57.446851414Z stderr F time="2022-01-28T18:42:57Z" level=info msg=syncing id=w3cYK ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:57.497347821Z stderr F time="2022-01-28T18:42:57Z" level=info msg="attenuated service account query failed - no operator group found that is managing this namespace" id=w3cYK ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:57.539212564Z stderr F E0128 18:42:57.538439       1 queueinformer_operator.go:290] sync {"update" "ns-8n6hh/ip"} failed: attenuated service account query failed - no operator group found that is managing this namespace
2022-01-28T18:42:57.539232565Z stderr F time="2022-01-28T18:42:57Z" level=info msg=syncing id=/uMKJ ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:57.629912004Z stderr F time="2022-01-28T18:42:57Z" level=info msg="attenuated service account query failed - no operator group found that is managing this namespace" id=/uMKJ ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:57.681570688Z stderr F E0128 18:42:57.680792       1 queueinformer_operator.go:290] sync {"update" "ns-8n6hh/ip"} failed: attenuated service account query failed - no operator group found that is managing this namespace
2022-01-28T18:42:57.683869539Z stderr F time="2022-01-28T18:42:57Z" level=info msg=syncing id=CMRMm ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:57.696906293Z stderr F time="2022-01-28T18:42:57Z" level=info msg="attenuated service account query failed - no operator group found that is managing this namespace" id=CMRMm ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:57.731208939Z stderr F E0128 18:42:57.731093       1 queueinformer_operator.go:290] sync {"update" "ns-8n6hh/ip"} failed: attenuated service account query failed - no operator group found that is managing this namespace
2022-01-28T18:42:57.738284503Z stderr F time="2022-01-28T18:42:57Z" level=info msg=syncing id=j5Ja8 ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:57.742706492Z stderr F time="2022-01-28T18:42:57Z" level=info msg="attenuated service account query failed - no operator group found that is managing this namespace" id=j5Ja8 ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:57.755134407Z stderr F E0128 18:42:57.755044       1 queueinformer_operator.go:290] sync {"update" "ns-8n6hh/ip"} failed: attenuated service account query failed - no operator group found that is managing this namespace
2022-01-28T18:42:57.756312084Z stderr F time="2022-01-28T18:42:57Z" level=info msg=syncing id=eIMRH ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:57.7580943Z stderr F time="2022-01-28T18:42:57Z" level=info msg="attenuated service account query failed - no operator group found that is managing this namespace" id=eIMRH ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:57.762736504Z stderr F E0128 18:42:57.762653       1 queueinformer_operator.go:290] sync {"update" "ns-8n6hh/ip"} failed: attenuated service account query failed - no operator group found that is managing this namespace
2022-01-28T18:42:57.844320948Z stderr F time="2022-01-28T18:42:57Z" level=info msg=syncing id=DRdrQ ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:57.856958876Z stderr F time="2022-01-28T18:42:57Z" level=info msg="attenuated service account query failed - no operator group found that is managing this namespace" id=DRdrQ ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:57.861376565Z stderr F E0128 18:42:57.861281       1 queueinformer_operator.go:290] sync {"update" "ns-8n6hh/ip"} failed: attenuated service account query failed - no operator group found that is managing this namespace
2022-01-28T18:42:58.023882107Z stderr F time="2022-01-28T18:42:58Z" level=info msg=syncing id=NYaCC ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:58.031423601Z stderr F time="2022-01-28T18:42:58Z" level=info msg="attenuated service account query failed - no operator group found that is managing this namespace" id=NYaCC ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:58.036289819Z stderr F E0128 18:42:58.036165       1 queueinformer_operator.go:290] sync {"update" "ns-8n6hh/ip"} failed: attenuated service account query failed - no operator group found that is managing this namespace
2022-01-28T18:42:58.03798443Z stderr F time="2022-01-28T18:42:58Z" level=info msg=syncing id=DecTc ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:58.039750846Z stderr F time="2022-01-28T18:42:58Z" level=info msg="attenuated service account query failed - no operator group found that is managing this namespace" id=DecTc ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:58.048911345Z stderr F E0128 18:42:58.048806       1 queueinformer_operator.go:290] sync {"update" "ns-8n6hh/ip"} failed: attenuated service account query failed - no operator group found that is managing this namespace
2022-01-28T18:42:58.357696052Z stderr F time="2022-01-28T18:42:58Z" level=info msg=syncing id=tUTy0 ip=ip namespace=ns-8n6hh phase=Installing
2022-01-28T18:42:58.361576206Z stderr F time="2022-01-28T18:42:58Z" level=info msg="attenuated service account query failed - no operator group found that is managing this namespace" id=tUTy0 ip=ip namespace=ns-8n6hh phase=Installing

There was no InstallPlan reconcile retry after this.(18:42:58.361576206Z) because it exceeded the retry count here: https://github.com/operator-framework/operator-lifecycle-manager/blob/623572b7ade72694f5c9337cfe21edf61888cb9a/pkg/lib/queueinformer/queueinformer_operator.go#L288

OperatorGroup was created after that.(18:42:58.46908094Z) olm-operator

2022-01-28T18:42:58.46908094Z stderr F time="2022-01-28T18:42:58Z" level=debug msg="updated target namespaces" namespace=ns-8n6hh operatorGroup=og targetNamespaces="[ns-8n6hh]"
2022-01-28T18:42:58.469108442Z stderr F time="2022-01-28T18:42:58Z" level=debug msg="OperatorGroup namespaces change detected" namespace=ns-8n6hh operatorGroup=og
2022-01-28T18:42:58.469112743Z stderr F time="2022-01-28T18:42:58Z" level=debug msg="namespace change detected" namespace=ns-8n6hh operatorGroup=og targets="[ns-8n6hh]"
2022-01-28T18:42:58.479606329Z stderr F time="2022-01-28T18:42:58Z" level=debug msg="operatorgroup status updated" namespace=ns-8n6hh operatorGroup=og
2022-01-28T18:42:58.47962623Z stderr F time="2022-01-28T18:42:58Z" level=debug msg="Requeueing out of sync namespaces" namespace=ns-8n6hh operatorGroup=og
2022-01-28T18:42:58.479630631Z stderr F time="2022-01-28T18:42:58Z" level=debug msg=requeueing namespace=ns-8n6hh operatorGroup=og
2022-01-28T18:42:58.480876412Z stderr F time="2022-01-28T18:42:58Z" level=debug msg="updated target namespaces" namespace=ns-8n6hh operatorGroup=og targetNamespaces="[ns-8n6hh]"
2022-01-28T18:42:58.480888713Z stderr F time="2022-01-28T18:42:58Z" level=debug msg="check that operatorgroup has updated CSV anotations" namespace=ns-8n6hh operatorGroup=og
2022-01-28T18:42:58.480892513Z stderr F time="2022-01-28T18:42:58Z" level=debug msg="OperatorGroup CSV annotation completed" namespace=ns-8n6hh operatorGroup=og
2022-01-28T18:42:58.517834031Z stderr F time="2022-01-28T18:42:58Z" level=debug msg="operatorgroup clusterroles ensured" namespace=ns-8n6hh operatorGroup=og

So the InstallPlan stayed in the Installing phase forever.