opensearch-project / opensearch-k8s-operator

OpenSearch Kubernetes Operator
Apache License 2.0
393 stars 207 forks source link

Changing the storage class of a component leads to a operator crash #587

Open tomsiewert opened 1 year ago

tomsiewert commented 1 year ago

If the storage class of an already deployed component gets changed, the operator will crash with a panic. The behaviour is kind of expected as the storage class is immutable in a StatefulSet, but I'd expect that the CRD validation prohibits me to change the storage class.

Panic:

1.6920103233611114e+09  INFO    Observed a panic in reconciler: runtime error: index out of range [0] with length 0 {"controller": "opensearchcluster", "controllerGroup": "opensearch.opster.io", "controllerKind": "OpenSearchCluster", "OpenSearchCluster": {"name":"egh-opensearch1","namespace":"opensearch"}, "namespace": "opensearch", "name": "egh-opensearch1", "reconcileID": "8afc127a-c94e-4ab8-a8ae-eaadfd55f81c"}
panic: runtime error: index out of range [0] with length 0 [recovered]
    panic: runtime error: index out of range [0] with length 0

goroutine 440 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:118 +0x1f4
panic({0x18df660, 0xc000e1e000})
    /usr/local/go/src/runtime/panic.go:884 +0x212
opensearch.opster.io/pkg/reconcilers.(*ClusterReconciler).reconcileNodeStatefulSet(0xc00001f2d0, {{0xc0009c4126, 0x6}, 0x3, {0xc0009c412c, 0x4}, {0xc000c5a7e0, 0xc000c5a810}, {0xc0009c4130, 0xd}, ...}, ...)
    /workspace/pkg/reconcilers/cluster.go:287 +0x1cf9
opensearch.opster.io/pkg/reconcilers.(*ClusterReconciler).Reconcile(0xc00001f2d0)
    /workspace/pkg/reconcilers/cluster.go:109 +0xc50
opensearch.opster.io/controllers.(*OpenSearchClusterReconciler).reconcilePhaseRunning(0xc0003ccd20, {0x1c892d8?, 0xc000c5a5d0})
    /workspace/controllers/opensearchController.go:320 +0x725
opensearch.opster.io/controllers.(*OpenSearchClusterReconciler).Reconcile(0xc0003ccd20, {0x1c892d8?, 0xc000c5a5d0}, {{{0xc0009c40a0, 0xa}, {0xc0009c4090, 0xf}}})
    /workspace/controllers/opensearchController.go:141 +0x7a8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1c89230?, {0x1c892d8?, 0xc000c5a5d0?}, {{{0xc0009c40a0?, 0x1915a60?}, {0xc0009c4090?, 0xc0002b8de8?}}})
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:121 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000c1c3c0, {0x1c89230, 0xc00092bac0}, {0x1845cc0?, 0xc0003a8080?})
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:320 +0x33c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000c1c3c0, {0x1c89230, 0xc00092bac0})
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:234 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:230 +0x333

Minimal config where I can reproduce that:

---
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: test-cluster
spec:
  general:
    serviceName: test-cluster
    version: 2.8.0
  dashboards:
    enable: true
    version: 2.8.0
    replicas: 1
    resources:
      requests:
         memory: "512Mi"
         cpu: "200m"
      limits:
         memory: "512Mi"
         cpu: "200m"
  nodePools:
    - component: master
      replicas: 3
      diskSize: "10Gi"
      jvm: -Xmx2G -Xms2G
      resources:
        requests:
          memory: "4Gi"
          cpu: "1"
        limits:
          memory: "4Gi"
          cpu: "1"
      roles:
        - "cluster_manager"
        - "data"
      persistence:
        pvc:
          storageClass: firststorageever
          accessModes:
            - ReadWriteOnce

After a successful deployment, change the sc from firststorageever to another sc in the cluster.

swoehrl-mw commented 1 year ago

Hi @tomsiewert. Thanks for reporting this. You are right that this error should be caught and should not crash the operator.

Currently the operator does not support validation during aply (using a validatingwebhook), as such a short term fix would be to catch this error during reconciliation and report it using an event + status. Longer term proper validation should be implemented.