Operator terminates nodes via restart after k8s node removed via spot request termination.

mleklund commented 2 years ago

I have been toying with Opensearch on spot nodes. When a spot k8s node gets recalled, the opensearch node gets rescheduled as expected. What is unusual is that a rolling restart gets issued by the controller and any opensearch nodes with an index lower then the rescheduled node also are restarted. I expected the opensearch node to get rescheduled and go about it's normal business instead.

this happened in rapid succession:

content-data-3                                            1/1     Terminating   0          107m
content-data-0                                            1/1     Terminating   0          95m
content-data-1                                            1/1     Terminating   0          93m
content-data-2                                            1/1     Terminating   0          92m

where the k8s node (on EKS) containing content-data-3 was issued an interruption request.

I have tried turning off smartScaler and setting a PodDisruptionBudget, with no success.

prudhvigodithi commented 2 years ago

Hey @mleklund can you share your cluster yaml file?

idanl21 commented 2 years ago

Hey @mleklund, can you please share your cluster yaml file? also, are you running all of the nodes on the same spot?

mleklund commented 2 years ago

No pods were all on separate nodes and even in separate AZs.

Nothing earth shattering here:

apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: content
  namespace: opensearch
spec:
  general:
    version: 2.2.1
    httpPort: 9200
    vendor: opensearch
    serviceName: content
    pluginsList: ["repository-s3"," https://github.com/aiven/prometheus-exporter-plugin-for-opensearch/releases/download/2.2.1.0/prometheus-exporter-2.2.1.0.zip"]
    setVMMaxMapCount: true
  dashboards:
    version: 2.2.1
    enable: false
    tls:
      enable: true
      generate: true
    opensearchCredentialsSecret:
      name: content-credentials 
    replicas: 1
    resources:
      requests:
         memory: "1Gi"
         cpu: "500m"
  confMgmt:
    smartScaler: false
  security:
    config:  # Everything related to the securityconfig
      securityConfigSecret:
        name: content-securityconfig 
      adminCredentialsSecret:
        name: content-credentials 
    # TLS is required, if you do not let it generate it used pre-defined certs from docker image.
    tls:
      transport:
        generate: true  # Have the operator generate and sign certificates
        perNode: false
      http:
        generate: true
  nodePools:
    - component: manager
      replicas: 3
      diskSize: "1Gi"
      resources:
        requests:
          memory: "1Gi"
          cpu: "550m"
      roles:
        - cluster_manager
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: "topology.kubernetes.io/zone"
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              opster.io/opensearch-cluster: content
              opster.io/opensearch-nodepool: manager
        - maxSkew: 1
          topologyKey: "kubernetes.io/hostname"
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              opster.io/opensearch-cluster: content
              opster.io/opensearch-nodepool: manager
    - component: data
#      jvm: -Xmx8G -Xms8G
      replicas: 6
      diskSize: "5Gi"
#     diskSize: "20Gi"
      resources:
        requests:
          memory: "1Gi"
          cpu: "550m"
#            memory: "12Gi"
#            cpu: "3.5"
      roles:
        - "data"
      nodeSelector:
        nodeType: opensearch
      tolerations:
        - key: "nodeType"
          operator: "Equal"
          value: "opensearch"
          effect: "NoSchedule"
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: "topology.kubernetes.io/zone"
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              opster.io/opensearch-cluster: content
              opster.io/opensearch-nodepool: data
        - maxSkew: 1
          topologyKey: "kubernetes.io/hostname"
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              opster.io/opensearch-cluster: content
              opster.io/opensearch-nodepool: data

After the new node joins the cluster and the pod is assigned to that new node, the stateful set appears to scale to zero and scale back up.

rhys-evans commented 1 year ago

Hi

I have been able to recreate this just by deleting a pod using kubectl

Sorry may or may not be helpful

Thanks

beejaygee commented 1 year ago

I'm also encountering this issue. I'm using Karpenter and when initiating a spot termination request, the following is logged by Karpenter controller;

2023-02-22T09:14:24.297Z INFO controller.termination cordoned node {"commit": "beb0a64-dirty", "node": "ip-10-0-166-9.ap-southeast-2.compute.internal"} 2023-02-22T09:14:24.300Z INFO controller.interruption deleted node from interruption message {"commit": "beb0a64-dirty", "queue": "KarpenterInterruptionQueue", "messageKind": "SpotInterruptionKind", "node": "ip-10-0-166-9.ap-southeast-2.compute.internal", "action": "CordonAndDrain"} 2023-02-22T09:14:55.606Z INFO controller.node added TTL to empty node {"commit": "beb0a64-dirty", "node": "ip-10-0-151-29.ap-southeast-2.compute.internal"} 2023-02-22T09:14:55.687Z INFO controller.node removed emptiness TTL from node {"commit": "beb0a64-dirty", "node": "ip-10-0-151-29.ap-southeast-2.compute.internal"} 2023-02-22T09:14:55.993Z INFO controller.node added TTL to empty node {"commit": "beb0a64-dirty", "node": "ip-10-0-105-181.ap-southeast-2.compute.internal"} 2023-02-22T09:14:56.785Z INFO controller.termination deleted node {"commit": "beb0a64-dirty", "node": "ip-10-0-166-9.ap-southeast-2.compute.internal"} 2023-02-22T09:15:00.502Z INFO controller.node removed emptiness TTL from node {"commit": "beb0a64-dirty", "node": "ip-10-0-105-181.ap-southeast-2.compute.internal"} 2023-02-22T09:15:01.543Z INFO controller.provisioner pod default/wazuh-indexer-nodes-2 has a preferred Anti-Affinity which can prevent consolidation {"commit": "beb0a64-dirty"} 2023-02-22T09:15:01.585Z INFO controller.provisioner found provisionable pod(s) {"commit": "beb0a64-dirty", "pods": 1} 2023-02-22T09:15:01.585Z INFO controller.provisioner computed new node(s) to fit pod(s) {"commit": "beb0a64-dirty", "nodes": 1, "pods": 1} 2023-02-22T09:15:01.586Z INFO controller.provisioner launching machine with 1 pods requesting {"cpu":"655m","memory":"2168Mi","pods":"4"} from types c6g.16xlarge, r6gd.2xlarge, m6gd.xlarge, m6gd.12xlarge, c6gn.xlarge and 53 other(s) {"commit": "beb0a64-dirty", "provisioner": "spot-provisioner-arm64"} 2023-02-22T09:15:06.485Z INFO controller.provisioner.cloudprovider launched new instance {"commit": "beb0a64-dirty", "provisioner": "spot-provisioner-arm64", "id": "i-04d723133149ebfdf", "hostname": "ip-10-0-168-220.ap-southeast-2.compute.internal", "instance-type": "m6g.medium", "zone": "ap-southeast-2c", "capacity-type": "spot"} 2023-02-22T09:17:52.392Z DEBUG controller.aws deleted launch template {"commit": "beb0a64-dirty"}

You can see from the logs that Karpenter detects the spot termination request, cordons the node, deletes the message. You can see my other nodes get marked as being 'empty' and thus the emptiness TTL is set because all my Opensearch pods in my statefulset are terminated and thus the node/s becomes empty. Immediately after the pods terminate, they restart and thus the emptiness TTL is removed because the node/s are no longer empty. Visibly checking my pod status also reveals that they terminate and launch again. Not sure why this is happening.

Nerodon commented 1 year ago

It appears to be fixed on main, following some logic added by #567

Essentially, the current code for 2.3.1 starts a rolling restart when you delete any pod of a given statefulset even if the revision has not changed, which happens when draining nodes, in my case, during AKS K8S or OS Image upgrades.

This would cause all pods of ordinal numbers lower than the deleted pod to also get terminated simultaneously.

Because the Statefulset revision does not change when simply deleting a pod or draining a node, then a rolling restart should not be done. Which is the new behavior case after #567, specifically here: https://github.com/Opster/opensearch-k8s-operator/blob/19c9924624eb4dd1067959f8f757179605e51eeb/opensearch-operator/pkg/helpers/helpers.go#L427

But this could probably be enhanced further by not triggering any rolling restart events when statefulset revisions don't change.

opensearch-project / opensearch-k8s-operator

Operator terminates nodes via restart after k8s node removed via spot request termination. #312