opensearch-project / opensearch-k8s-operator

OpenSearch Kubernetes Operator
Apache License 2.0
389 stars 206 forks source link

[BUG] Increased diskSize is not reflected in the pods filesystem #723

Open dmantas opened 8 months ago

dmantas commented 8 months ago

What is the bug?

When we increase diskSize in the OpensearchCluster spec, the PVCs are resized accordingly. However, filesystem in the pods is not resized. Or, to be precise, it might be resized for some of the pods.

We use Openstack with Cinder volume (Storage class: csi-cinder-sc-delete).

How can one reproduce the bug?

In an already deployed cluster with diskSize for a nodePool equal to 5Gi, modify diskSize and increase it to 6Gi. Then apply the manifest. Then increase it to 7Gi and apply the manifest again. So the nodePools config looks like

    nodePools:
    - affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: opster.io/opensearch-nodepool
                operator: In
                values:
                - data-nodes
            topologyKey: kubernetes.io/hostname
      component: data-nodes
      diskSize: 7Gi
      replicas: 3
      resources:
        limits:
          cpu: 500m
          memory: 2Gi
        requests:
          cpu: 500m
          memory: 2Gi
      roles:
      - data

Check the PVCs, they are 7Gi in size:

data-caas-opensearch-data-nodes-0         Bound    pvc-058d4eda-2800-42ae-9859-f3518db541a9   7Gi        RWO            csi-cinder-sc-delete   5d21h
data-caas-opensearch-data-nodes-1         Bound    pvc-af24413f-2eaf-4743-b234-2fc27c682cbf   7Gi        RWO            csi-cinder-sc-delete   5d20h
data-caas-opensearch-data-nodes-2         Bound    pvc-f270c56a-8ea6-4228-9177-ef4e554dceff   7Gi        RWO            csi-cinder-sc-delete   5d20h

Check the pods filesystem:

Pod-0:

/dev/vdh        5.9G  5.5M  5.9G   1% /usr/share/opensearch/data

Pod-1:

/dev/vdg        5.9G  5.1M  5.9G   1% /usr/share/opensearch/data

Pod-2:

/dev/vdi        4.9G  1.8M  4.9G   1% /usr/share/opensearch/data

So we can see that somehow 2 out of 3 pods saw the first resizing to 6Gi, but not the second resizing to 7Gi. The 3rd pod is still as if no resizing took place.

I tried to delete the pods, but it doesn't help.

What is the expected behavior?

The PVCs should be 7Gi. When checking the filesystem from inside the pod (df -h), the filesystem should be 7 gigabytes.

What is your host/environment?

Operator version 2.5.1 (also tested with 2.4.0).

prudhvigodithi commented 6 months ago

[Triage] Thanks @dmantas the steps implemented as part of the design https://github.com/opensearch-project/opensearch-k8s-operator/issues/112#issuecomment-1107854259 should have the size reflected inside the POD as well, can you see if there are any errors in the operator pod? Thanks Adding @dbason @swoehrl-mw @jochenkressin @pchmielnik @salyh @bbarani

dmantas commented 6 months ago

Hi @prudhvigodithi , I followed these steps and I was successful, thank you. Although after deleting the sts, this was immediately being recreated by the Operator. I had to temporarily delete the Operator to prevent this. Is there any way I can temporarily prevent the Operator from messing with the deployment without deleting it completely? E.g. in a similar case for the Prometheus Operator, it has a pause flag to achieve this.

Other than this method, I understand it's not possible to resize volumes automatically, i.e. by modifying the OpensearchCluster manifest, correct? Because I did a test today and I noticed these errors in the Operator logs:

{"level":"dpanic","ts":"2024-04-02T11:51:56.608Z","msg":"odd number of arguments passed as key-value pairs for logging","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"caas-opensearch","namespace":"opensearch-operator-cluster"},"namespace":"opensearch-operator-cluster","name":"caas-opensearch","reconcileID":"7cea47f7-e5e4-430c-9d30-bc066f133eeb","ignored key":"9Gi","stacktrace":"github.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*ClusterReconciler).maybeUpdateVolumes\n\t/workspace/pkg/reconcilers/cluster.go:490\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*ClusterReconciler).reconcileNodeStatefulSet\n\t/workspace/pkg/reconcilers/cluster.go:301\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*ClusterReconciler).Reconcile\n\t/workspace/pkg/reconcilers/cluster.go:116\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/controllers.(*OpenSearchClusterReconciler).reconcilePhaseRunning\n\t/workspace/controllers/opensearchController.go:319\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/controllers.(*OpenSearchClusterReconciler).Reconcile\n\t/workspace/controllers/opensearchController.go:141\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226"}
{"level":"info","ts":"2024-04-02T11:51:56.608Z","msg":"Disk sizes differ for nodePool %s, Current: %s, Desired: %s","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"caas-opensearch","namespace":"opensearch-operator-cluster"},"namespace":"opensearch-operator-cluster","name":"caas-opensearch","reconcileID":"7cea47f7-e5e4-430c-9d30-bc066f133eeb","data-nodes":"8Gi"}
{"level":"info","ts":"2024-04-02T11:51:56.608Z","msg":"Deleting statefulset while orphaning pods caas-opensearch-data-nodes","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"caas-opensearch","namespace":"opensearch-operator-cluster"},"namespace":"opensearch-operator-cluster","name":"caas-opensearch","reconcileID":"7cea47f7-e5e4-430c-9d30-bc066f133eeb"}
{"level":"info","ts":"2024-04-02T11:51:56.928Z","msg":"object  is being deleted, backing off","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"caas-opensearch","namespace":"opensearch-operator-cluster"},"namespace":"opensearch-operator-cluster","name":"caas-opensearch","reconcileID":"7cea47f7-e5e4-430c-9d30-bc066f133eeb","name":"caas-opensearch-data-nodes","namespace":"opensearch-operator-cluster","apiVersion":"apps/v1","kind":"StatefulSet"}
{"level":"error","ts":"2024-04-02T11:51:57.050Z","msg":"Reconciler error","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"caas-opensearch","namespace":"opensearch-operator-cluster"},"namespace":"opensearch-operator-cluster","name":"caas-opensearch","reconcileID":"7cea47f7-e5e4-430c-9d30-bc066f133eeb","error":"StatefulSet.apps \"caas-opensearch-data-nodes\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226"}
{"level":"info","ts":"2024-04-02T11:51:57.051Z","msg":"Reconciling OpenSearchCluster","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"caas-opensearch","namespace":"opensearch-operator-cluster"},"namespace":"opensearch-operator-cluster","name":"caas-opensearch","reconcileID":"80ef6069-6650-4532-8636-f3756e9a9f02","cluster":{"name":"caas-opensearch","namespace":"opensearch-operator-cluster"}}
{"level":"info","ts":"2024-04-02T11:51:57.089Z","msg":"Generating certificates","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"caas-opensearch","namespace":"opensearch-operator-cluster"},"namespace":"opensearch-operator-cluster","name":"caas-opensearch","reconcileID":"80ef6069-6650-4532-8636-f3756e9a9f02","interface":"transport"}
{"level":"info","ts":"2024-04-02T11:51:57.089Z","msg":"Generating certificates","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"caas-opensearch","namespace":"opensearch-operator-cluster"},"namespace":"opensearch-operator-cluster","name":"caas-opensearch","reconcileID":"80ef6069-6650-4532-8636-f3756e9a9f02","interface":"http"}
{"level":"info","ts":"2024-04-02T11:51:57.210Z","msg":"resource created","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"caas-opensearch","namespace":"opensearch-operator-cluster"},"namespace":"opensearch-operator-cluster","name":"caas-opensearch","reconcileID":"80ef6069-6650-4532-8636-f3756e9a9f02","name":"caas-opensearch-data-nodes","namespace":"opensearch-operator-cluster","apiVersion":"apps/v1","kind":"StatefulSet"}
{"level":"info","ts":"2024-04-02T11:52:00.592Z","msg":"Starting rolling restart of the StatefulSet caas-opensearch-data-nodes","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"caas-opensearch","namespace":"opensearch-operator-cluster"},"namespace":"opensearch-operator-cluster","name":"caas-opensearch","reconcileID":"80ef6069-6650-4532-8636-f3756e9a9f02","reconciler":"restart"}
{"level":"info","ts":"2024-04-02T11:52:00.596Z","msg":"Preparing to restart pod caas-opensearch-data-nodes-0","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"caas-opensearch","namespace":"opensearch-operator-cluster"},"namespace":"opensearch-operator-cluster","name":"caas-opensearch","reconcileID":"80ef6069-6650-4532-8636-f3756e9a9f02","reconciler":"restart"}

So it seems the Operator "knows" how to do the resize, but it doesn't work. It even seems only the first pod in the StatefulSet is restarted, but even for that one the filesystem inside the pod isn't increased - but all PVCs are increased. Is there something not working correctly in this implementation?

Again thanks for your help, at least I seem to have a consistent way to do this.