opensearch-project / opensearch-k8s-operator

OpenSearch Kubernetes Operator
Apache License 2.0
365 stars 192 forks source link

[BUG] OpenSearch operator panics and crashes when adding an OpenSearchISMPolicy #801

Closed nilushancosta closed 1 month ago

nilushancosta commented 1 month ago

What is the bug?

When adding an OpenSearchISMPolicy while the OpenSearch cluster is getting created, the controller panics resulting in a container crash

2024-05-06T18:19:54.202Z    INFO    Reconciling OpensearchISMPolicy {"controller": "opensearchismpolicy", "controllerGroup": "opensearch.opster.io", "controllerKind": "OpenSearchISMPolicy", "OpenSearchISMPolicy": {"name":"sample-policy","namespace":"test"}, "namespace": "test", "name": "sample-policy", "reconcileID": "adc1b967-662a-42d0-9c17-95e048ad0ad6", "tenant": {"name":"sample-policy","namespace":"test"}}
2024-05-06T18:19:54.279Z    DEBUG   events  error creating opensearch client    {"type": "Warning", "object": {"kind":"OpenSearchISMPolicy","namespace":"test","name":"sample-policy","uid":"abab26b9-2ca0-4882-a167-4cf37994dcb9","apiVersion":"opensearch.opster.io/v1","resourceVersion":"463314"}, "reason": "OpensearchError"}
2024-05-06T18:19:54.284Z    INFO    Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference    {"controller": "opensearchismpolicy", "controllerGroup": "opensearch.opster.io", "controllerKind": "OpenSearchISMPolicy", "OpenSearchISMPolicy": {"name":"sample-policy","namespace":"test"}, "namespace": "test", "name": "sample-policy", "reconcileID": "adc1b967-662a-42d0-9c17-95e048ad0ad6"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x11f2d64]

goroutine 442 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:115 +0x1a4
panic({0x141dec0?, 0x27073d0?})
    /usr/local/go/src/runtime/panic.go:770 +0x124
github.com/Opster/opensearch-k8s-operator/opensearch-operator/opensearch-gateway/services.(*OsClusterClient).GetISMConfig(0x0, {0x18fcd30, 0x4000e77dd0}, {0x4000c5a410?, 0x0?})
    /workspace/opensearch-gateway/services/os_client.go:314 +0x44
github.com/Opster/opensearch-k8s-operator/opensearch-operator/opensearch-gateway/services.PolicyExists({0x18fcd30?, 0x4000e77dd0?}, 0x4001436700?, {0x4000c5a410?, 0x7?})
    /workspace/opensearch-gateway/services/os_ism_service.go:31 +0x4c
github.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*IsmPolicyReconciler).Reconcile(0x40008d5d00)
    /workspace/pkg/reconcilers/ismpolicy.go:159 +0x72c
github.com/Opster/opensearch-k8s-operator/opensearch-operator/controllers.(*OpensearchISMPolicyReconciler).Reconcile(0x400051abe0, {0x18fcd30, 0x4000e77dd0}, {{{0x4001558638, 0x4}, {0x4001558640, 0xd}}})
    /workspace/controllers/opensearchism_controller.go:53 +0x2ec
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x18fcd30?, {0x18fcd30?, 0x4000e77dd0?}, {{{0x4001558638?, 0x1348fc0?}, {0x4001558640?, 0x4000677e08?}}})
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118 +0x8c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x400028a640, {0x18fcd68, 0x400051b630}, {0x149b600, 0x40002689e0})
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314 +0x294
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x400028a640, {0x18fcd68, 0x400051b630})
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265 +0x198
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226 +0x74
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 129
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:222 +0x404

The operator pod will crash several times and then continue running.

How can one reproduce the bug?

  1. Install the operator

    helm install opensearch-operator opensearch-operator/opensearch-operator --version 2.6.0 -n test
  2. Create an OpenSearch cluster using kubectl apply. This is the cluster definition I used

    apiVersion: opensearch.opster.io/v1
    kind: OpenSearchCluster
    metadata:
    name: my-first-cluster
    namespace: test
    spec:
    general:
    serviceName: my-first-cluster
    version: 2.11.1
    dashboards:
    enable: false
    version: 2.11.1
    replicas: 0
    nodePools:
    - component: nodes
      replicas: 3
      diskSize: "5Gi"
      nodeSelector:
      resources:
         requests:
            memory: "1Gi"
            cpu: "500m"
         limits:
            memory: "1Gi"
            cpu: "500m"
      roles:
        - "cluster_manager"
        - "data"
  3. Apply the following ISM policy using kubectl apply

    apiVersion: opensearch.opster.io/v1
    kind: OpenSearchISMPolicy
    metadata:
    name: sample-policy
    namespace: test
    spec:
    opensearchCluster:
      name: my-first-cluster
    description: Sample policy
    policyId: sample-policy
    defaultState: hot
    states:
      - name: hot
        actions:
           - replicaCount:
                numberOfReplicas: 4
        transitions:
           - stateName: warm
             conditions:
                minIndexAge: "10d"
      - name: warm
        actions:
           - replicaCount:
                numberOfReplicas: 2
        transitions:
           - stateName: delete
             conditions:
                minIndexAge: "30d"
      - name: delete
        actions:
           - delete: {}

    At this point, the operator pod would exit with an error

What is the expected behavior?

EXpected the ISM Policy to be added without an issue

What is your host/environment?

Kubernetes 1.25 OpenSearch 2.11.1 OpenSearch operator 2.6.0

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

If I do step 2 above and wait for the OpenSearch cluster to complete getting created (i.e. the 3 nodes come to a running state and the cluster health is green) and then do step 3 (add ISM policy), the panic does not happen. But if I do step 3 immediately after step 2, then the operator panics and crashes several times and.

However, when using deployment pipelines, we cannot control the delay between resources

swoehrl-mw commented 1 month ago

Hi @nilushancosta. Thanks for reporting this. This is clearly a bug and the operator should just wait if the cluster is not yet correctly reachable.