operator-framework / helm-operator-plugins

Experimental refactoring of the operator-framework's helm operator
Apache License 2.0
49 stars 49 forks source link

Operator infinite loop on transient errors from kube API #378

Closed porridge closed 1 month ago

porridge commented 2 months ago

This change to unit tests shows the issue:

• [FAILED] [0.011 seconds]
Updater when an update is a change [It] should apply an update status function
/home/mowsiany/go/src/github.com/operator-framework/helm-operator-plugins/pkg/reconciler/internal/updater/updater_test.go:102

  [FAILED] HaveLen matcher expects a string/array/map/channel/slice.  Got:
      <nil>: nil
  In [It] at: /home/mowsiany/go/src/github.com/operator-framework/helm-operator-plugins/pkg/reconciler/internal/updater/updater_test.go:108 @ 08/20/24 07:39:09.843

When there is a transient error from the API server, the updater never retries updates, as long as the update function correctly returns false if it has no effect.

This is because the Apply function reuses the same object on each update attempt. After the first attempt, subsequent invocations of the update functions show nothing is changed, so subsequent attempts are not made even if the first one failed.

What is worse, and could be considered a different issue on its own, in the deletion case the code proceeds to wait forever on the deletion to happen. But this never happens, because the updates done by doUninstall correctly return false on subsequent updates.

In my case the transient error was a 500 from the API server, caused in turn by a misbehaving validation webhook.

porridge commented 2 months ago

FTR, here are the entries in kube-apiserver log corresponding to requests from the controller. Note how it goes into a loop constantly issuing GETs after the 500 response to update. (Some details omitted for brevity.)

I0806 00:15:11.933824 verb="GET" URI="/api/v1/namespaces/kuttl-test-amazing-bonefish/secrets/sh.helm.release.v1.stackrox-central-services.v1" resp=200
I0806 00:15:12.136666 verb="DELETE" URI="/api/v1/namespaces/kuttl-test-amazing-bonefish/secrets/sh.helm.release.v1.stackrox-central-services.v1" resp=200
I0806 00:15:12.209415 verb="GET" URI="/api/v1/namespaces/kuttl-test-amazing-bonefish/secrets/sh.helm.release.v1.stackrox-central-services.v2" resp=200
I0806 00:15:12.379657 verb="DELETE" URI="/api/v1/namespaces/kuttl-test-amazing-bonefish/secrets/sh.helm.release.v1.stackrox-central-services.v2" resp=200
I0806 00:15:12.401052 verb="GET" URI="/api/v1/namespaces/kuttl-test-amazing-bonefish/secrets/sh.helm.release.v1.stackrox-central-services.v3" resp=200
I0806 00:15:12.470125 verb="DELETE" URI="/api/v1/namespaces/kuttl-test-amazing-bonefish/secrets/sh.helm.release.v1.stackrox-central-services.v3" resp=200
I0806 00:15:12.504535 verb="GET" URI="/api/v1/namespaces/kuttl-test-amazing-bonefish/secrets/sh.helm.release.v1.stackrox-central-services.v4" resp=200
I0806 00:15:12.718316 verb="DELETE" URI="/api/v1/namespaces/kuttl-test-amazing-bonefish/secrets/sh.helm.release.v1.stackrox-central-services.v4" resp=200
I0806 00:15:12.769334 verb="GET" URI="/api/v1/namespaces/kuttl-test-amazing-bonefish/secrets/sh.helm.release.v1.stackrox-central-services.v5" resp=200
I0806 00:15:12.897561 verb="DELETE" URI="/api/v1/namespaces/kuttl-test-amazing-bonefish/secrets/sh.helm.release.v1.stackrox-central-services.v5" resp=200
I0806 00:15:12.950025 verb="PUT" URI="/apis/platform.stackrox.io/v1alpha1/namespaces/kuttl-test-amazing-bonefish/centrals/stackrox-central-services/status" resp=200
I0806 00:15:13.026250 verb="PUT" URI="/apis/platform.stackrox.io/v1alpha1/namespaces/kuttl-test-amazing-bonefish/centrals/stackrox-central-services" resp=500 statusStack=<
I0806 00:15:13.056250 verb="GET" URI="/apis/platform.stackrox.io/v1alpha1/namespaces/kuttl-test-amazing-bonefish/centrals/stackrox-central-services" resp=200
I0806 00:15:13.066183 verb="GET" URI="/apis/platform.stackrox.io/v1alpha1/namespaces/kuttl-test-amazing-bonefish/centrals/stackrox-central-services" resp=200
I0806 00:15:13.075118 verb="GET" URI="/apis/platform.stackrox.io/v1alpha1/namespaces/kuttl-test-amazing-bonefish/centrals/stackrox-central-services" resp=200
I0806 00:15:13.094776 verb="GET" URI="/apis/platform.stackrox.io/v1alpha1/namespaces/kuttl-test-amazing-bonefish/centrals/stackrox-central-services" resp=200
I0806 00:15:13.109019 verb="GET" URI="/apis/platform.stackrox.io/v1alpha1/namespaces/kuttl-test-amazing-bonefish/centrals/stackrox-central-services" resp=200
[these repeat until cluster turndown]