[BUG] Rolling restart make all data node pool restart at same time

alantang888 commented 8 months ago

What is the bug?

I have to allocation awareness on different AWS AZ. Then I set 3 data node pools with their AZ name to node.attr.az. eg: node pool data-a withus-east-1a. node pool data-b withus-east-1b. node pool data-d withus-east-1d...

When I change some cluster config. All data node pool restart at the same time. It cause cluster status turn red.

How can one reproduce the bug?

Have a cluster have 3 data node pools. Each on one AZ and have 1 replica. And set node.attr.. When cluster status is green. Make change to cluster trigger rolling restart

What is the expected behavior?

On docs mentioned The Operator will then perform a rolling upgrade and restart the nodes one-by-one, waiting after each node for the cluster to stabilize and have a green cluster status.. So I expect those node pool should restart one by one.eg: data-a triggered restart. Then cluster status turn yellow. When cluster back to green. Then it restart other pod on data-a (if any). After that, it restart data-b, and repeat until all data node pool is restarted. (Restart order of node pool is not important)

What is your host/environment?

EKS 1.28 Opensearch Operator 2.5.1 Opensearch 2.8.0

Do you have any screenshots?

Do you have any additional context?

This is my cluster config. (removed dashboard, which should not related)

apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: opensearch-logging
spec:
  general:
    httpPort: 9200
    version: 2.8.0
    serviceName: opensearch-logging
    setVMMaxMapCount: true
    pluginsList: ["repository-s3"]
    additionalConfig:
      cluster.routing.allocation.awareness.attributes: az
    monitoring:
      enable: true
      scrapeInterval: 30s
      monitoringUserSecret: metrics-exporter
      tlsConfig: # Optional, use this to override the tlsConfig of the generated ServiceMonitor, only the following provided options can be set currently
        insecureSkipVerify: true
    keystore:
    - secret:
        name: opensearch-aws-access
      keyMappings:
        # Renames key AWS_ACCESS_KEY_ID in secret to s3.client.default.access_key in keystore
        AWS_ACCESS_KEY_ID: s3.client.default.access_key
        AWS_SECRET_ACCESS_KEY: s3.client.default.secret_key

  nodePools:
  - component: cluster-manager
    additionalConfig:
      prometheus.indices: "false"
    diskSize: 10Gi
    jvm: -Xmx1024M -Xms1024M
    replicas: 3
    roles:
    - cluster_manager
    resources:
      requests:
        memory: "2Gi"
        cpu: "500m"
      limits:
        memory: "2Gi"
        cpu: "500m"
    nodeSelector:
      node_pool: infra
    tolerations:
    - effect: NoSchedule
      key: role
      operator: Equal
      value: infra
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: az
          labelSelector:
            matchLabels:
              # BEWARE: it will change to `opensearch.org/` in the future. https://github.com/opensearch-project/opensearch-k8s-operator/issues/664
              opster.io/opensearch-cluster: opensearch-logging
              opster.io/opensearch-nodepool: master
  # When modify data node. Normally only need to modify this one. Unless you need to modify label related setting
  - &data-node
    component: data-a
    additionalConfig:
      node.attr.az: us-east-1a
      prometheus.indices: "false"
    diskSize: 1000Gi
    jvm: -Xms6g -Xmx6g
    replicas: 1
    roles:
    - data
    resources:
      requests:
        cpu: "2"
        memory: 10Gi
      limits:
        cpu: "4"
        memory: 12Gi
    nodeSelector:
      node_pool: infra
      az: us-east-1a
    tolerations:
    - effect: NoSchedule
      key: role
      operator: Equal
      value: infra
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: kubernetes.io/hostname
          labelSelector:
            matchLabels:
              opster.io/opensearch-cluster: opensearch-logging
              opster.io/opensearch-nodepool: data-a
  - <<: *data-node
    component: data-b
    additionalConfig:
      node.attr.az: us-east-1b
      prometheus.indices: "false"
    nodeSelector:
      node_pool: infra
      az: us-east-1b
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: kubernetes.io/hostname
          labelSelector:
            matchLabels:
              opster.io/opensearch-cluster: opensearch-logging
              opster.io/opensearch-nodepool: data-b
  - <<: *data-node
    component: data-d
    additionalConfig:
      node.attr.az: us-east-1d
      prometheus.indices: "false"
    nodeSelector:
      node_pool: infra
      az: us-east-1d
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: kubernetes.io/hostname
          labelSelector:
            matchLabels:
              opster.io/opensearch-cluster: opensearch-logging
              opster.io/opensearch-nodepool: data-d
  - component: client
    additionalConfig:
      prometheus.indices: "false"
    persistence:
      emptyDir: {}
    jvm: -Xms3g -Xmx3g
    replicas: 2
    # Empty list not work. So set it to ingest. Even we not using ingest pipeline :P
    roles:
    - ingest
    resources:
      requests:
        cpu: 1500m
        memory: 6Gi
    nodeSelector:
      node_pool: infra
    tolerations:
    - effect: NoSchedule
      key: role
      operator: Equal
      value: infra
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: kubernetes.io/hostname
          labelSelector:
            matchLabels:
              opster.io/opensearch-cluster: opensearch-logging
              opster.io/opensearch-nodepool: client
    topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: az
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          opster.io/opensearch-cluster: opensearch-logging
          opster.io/opensearch-nodepool: client

salyh commented 7 months ago

Thank you for your contribution. Sorry for replying late but any chance you can provide logs for this issue or a minimalistic reproducer (because trying this to reproduce this in AWS might be hard)?

salyh commented 7 months ago

cc @idanl21 @dbason @swoehrl-mw @prudhvigodithi @jochenkressin @pchmielnik

alantang888 commented 7 months ago

I just build a lab environment. After cluster is green and all node are running. I modify spec.general.additionalConfig (This case I modify indices.query.bool.max_clause_count) to trigger rolling restart on cluster.

I start capture log before apply change. Then stop when all nodes restarted and cluster return to green. opensearch-operator.txt

opensearch-project / opensearch-k8s-operator