knative / operator

Combined operator for Knative.
Apache License 2.0
179 stars 98 forks source link

Knative Operator panics at boot #1652

Closed joshfrench closed 6 months ago

joshfrench commented 7 months ago

Describe the bug

Knative Operator pod crashes with a panic shortly after launch:

{"severity":"INFO","timestamp":"2023-12-06T20:57:08.867531412Z","logger":"knative-operator","caller":"leaderelection/context.go:158","message":"\"knative-operator-588d7bdd54-v8ddw_7a8eabba-b3b1-4e9f-8584-7d872c42479c\" has started leading \"knative-operator.knative.dev.operator.pkg.reconciler.knativeserving.reconciler.00-of-01\"","commit":"bd823f9-dirty","knative.dev/pod":"knative-operator-588d7bdd54-v8ddw"}
{"severity":"INFO","timestamp":"2023-12-06T20:57:08.877683974Z","logger":"knative-operator","caller":"knativeserving/knativeserving.go:83","message":"Deleting cluster-scoped resources","commit":"bd823f9-dirty","knative.dev/pod":"knative-operator-588d7bdd54-v8ddw","knative.dev/controller":"knative.dev.operator.pkg.reconciler.knativeserving.Reconciler","knative.dev/kind":"operator.knative.dev.KnativeServing","knative.dev/traceid":"92aa5db7-f984-4570-9556-e3f16e671eef","knative.dev/key":"knative-serving/knative-serving"}
panic: runtime error: index out of range [0] with length 0

goroutine 145 [running]:
knative.dev/operator/pkg/reconciler/common.FetchManifestFromArray({0x0?, 0xc000a64a80?, 0xc000a57588?})
        knative.dev/operator/pkg/reconciler/common/releases.go:237 +0x4f0
knative.dev/operator/pkg/reconciler/knativeserving.(*Reconciler).installed(0xc0000f8460, {0x204ee18, 0xc000a6a570}, {0x2066d50?, 0xc000a5c780})
        knative.dev/operator/pkg/reconciler/knativeserving/knativeserving.go:158 +0x96
knative.dev/operator/pkg/reconciler/knativeserving.(*Reconciler).FinalizeKind(0xc0000f8460, {0x204ee18, 0xc000a6a570}, 0xc0001b2230?)
        knative.dev/operator/pkg/reconciler/knativeserving/knativeserving.go:84 +0x35d
knative.dev/operator/pkg/client/injection/reconciler/operator/v1beta1/knativeserving.(*reconcilerImpl).Reconcile(0xc000335e00, {0x204ee18, 0xc000a6a540}, {0xc000a66120, 0x1f})
        knative.dev/operator/pkg/client/injection/reconciler/operator/v1beta1/knativeserving/reconciler.go:241 +0x3b8
knative.dev/pkg/controller.(*Impl).processNextWorkItem(0xc00010bc80)
        knative.dev/pkg@v0.0.0-20231103063133-e287426d1833/controller/controller.go:542 +0x4ad
knative.dev/pkg/controller.(*Impl).RunContext.func3()
        knative.dev/pkg@v0.0.0-20231103063133-e287426d1833/controller/controller.go:491 +0x59
created by knative.dev/pkg/controller.(*Impl).RunContext in goroutine 122
        knative.dev/pkg@v0.0.0-20231103063133-e287426d1833/controller/controller.go:489 +0x349

This is using an unmodified manifest from https://github.com/knative/operator/releases/download/knative-v1.12.1/operator.yaml.

Expected behavior Not a panic.

To Reproduce Apply Knative Operator manifest. Wait.

Knative release version I have witnessed this on v1.12.0 and v1.12.1.

atzawada commented 7 months ago

Seeing the same issue with version 1.11.10 as well:

{"severity":"INFO","timestamp":"2023-12-12T21:08:32.082738159Z","logger":"knative-operator","caller":"leaderelection/context.go:158","message":"\"knative-operator-6db6cdf8d5-4m8b7_7b06b719-fee2-43a3-8908-329fab17315b\" has started leading \"knative-operator.knative.dev.operator.pkg.reconciler.knativeserving.reconciler.00-of-01\"","commit":"cd1aafc-dirty","knative.dev/pod":"knative-operator-6db6cdf8d5-4m8b7"}
{"severity":"INFO","timestamp":"2023-12-12T21:08:32.102931158Z","logger":"knative-operator","caller":"knativeserving/knativeserving.go:83","message":"Deleting cluster-scoped resources","commit":"cd1aafc-dirty","knative.dev/pod":"knative-operator-6db6cdf8d5-4m8b7","knative.dev/controller":"knative.dev.operator.pkg.reconciler.knativeserving.Reconciler","knative.dev/kind":"operator.knative.dev.KnativeServing","knative.dev/traceid":"ed42b3fe-cddb-4ef7-9bc1-53160e625810","knative.dev/key":"knative-serving/knative-serving"}
panic: runtime error: index out of range [0] with length 0

goroutine 703 [running]:
knative.dev/operator/pkg/reconciler/common.FetchManifestFromArray({0x0?, 0x0?, 0x1b09c20?})
        knative.dev/operator/pkg/reconciler/common/releases.go:237 +0x4f0
knative.dev/operator/pkg/reconciler/knativeserving.(*Reconciler).installed(0xc000292000, {0x1ff4468, 0xc001a74c60}, {0x200bcd0?, 0xc001c8a000})
        knative.dev/operator/pkg/reconciler/knativeserving/knativeserving.go:158 +0x96
knative.dev/operator/pkg/reconciler/knativeserving.(*Reconciler).FinalizeKind(0xc000292000, {0x1ff4468, 0xc001a74c60}, 0xc0000dd1a0?)
        knative.dev/operator/pkg/reconciler/knativeserving/knativeserving.go:84 +0x35d
knative.dev/operator/pkg/client/injection/reconciler/operator/v1beta1/knativeserving.(*reconcilerImpl).Reconcile(0xc0003d4000, {0x1ff4468, 0xc001a74c30}, {0xc000e226e0, 0x1f})
        knative.dev/operator/pkg/client/injection/reconciler/operator/v1beta1/knativeserving/reconciler.go:241 +0x3b8
knative.dev/pkg/controller.(*Impl).processNextWorkItem(0xc000b96660)
        knative.dev/pkg@v0.0.0-20231023150739-56bfe0dd9626/controller/controller.go:542 +0x4ad
knative.dev/pkg/controller.(*Impl).RunContext.func3()
        knative.dev/pkg@v0.0.0-20231023150739-56bfe0dd9626/controller/controller.go:491 +0x59
created by knative.dev/pkg/controller.(*Impl).RunContext in goroutine 698
        knative.dev/pkg@v0.0.0-20231023150739-56bfe0dd9626/controller/controller.go:489 +0x349
joshfrench commented 7 months ago

I did not pinpoint the issue, but I have a hunch it was related to orphaned KnativeServing CRs left around from previous deployments, possibly in other namespaces.

ReToCode commented 7 months ago

I cannot reproduce this on an empty cluster, which also speaks for https://github.com/knative/operator/issues/1652#issuecomment-1854565124. Can you prove instructions (maybe try to get all existing Knative related K8s objects?) to reproduce the issue?

joshfrench commented 7 months ago

We moved past it by simply deleting all Knative-related resources and reinstalling. But as another data point we are using Helm to manage these charts, which notoriously does not remove CRDs when you delete a release. So there was a path to end up with a valid KnativeServing even after the operator was deleted. IIRC the impacted workflow was something like:

  1. Install operator
  2. Install KnativeServing CR
  3. Delete operator
  4. Reinstall operator (maybe in a different namespace than the surviving KnativeServing?)
houshengbo commented 6 months ago

@joshfrench @ReToCode I opened a PR fixing this issue. Just make sure that the list is not empty before calling common.FetchManifestFromArray.

houshengbo commented 6 months ago

Normally, status.manifests should not be empty at all with the operator. However, it looks like we have to check.