Open mleklund opened 2 years ago
Hey @mleklund can you share your cluster yaml file?
Hey @mleklund, can you please share your cluster yaml file? also, are you running all of the nodes on the same spot?
No pods were all on separate nodes and even in separate AZs.
Nothing earth shattering here:
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
name: content
namespace: opensearch
spec:
general:
version: 2.2.1
httpPort: 9200
vendor: opensearch
serviceName: content
pluginsList: ["repository-s3"," https://github.com/aiven/prometheus-exporter-plugin-for-opensearch/releases/download/2.2.1.0/prometheus-exporter-2.2.1.0.zip"]
setVMMaxMapCount: true
dashboards:
version: 2.2.1
enable: false
tls:
enable: true
generate: true
opensearchCredentialsSecret:
name: content-credentials
replicas: 1
resources:
requests:
memory: "1Gi"
cpu: "500m"
confMgmt:
smartScaler: false
security:
config: # Everything related to the securityconfig
securityConfigSecret:
name: content-securityconfig
adminCredentialsSecret:
name: content-credentials
# TLS is required, if you do not let it generate it used pre-defined certs from docker image.
tls:
transport:
generate: true # Have the operator generate and sign certificates
perNode: false
http:
generate: true
nodePools:
- component: manager
replicas: 3
diskSize: "1Gi"
resources:
requests:
memory: "1Gi"
cpu: "550m"
roles:
- cluster_manager
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
opster.io/opensearch-cluster: content
opster.io/opensearch-nodepool: manager
- maxSkew: 1
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
opster.io/opensearch-cluster: content
opster.io/opensearch-nodepool: manager
- component: data
# jvm: -Xmx8G -Xms8G
replicas: 6
diskSize: "5Gi"
# diskSize: "20Gi"
resources:
requests:
memory: "1Gi"
cpu: "550m"
# memory: "12Gi"
# cpu: "3.5"
roles:
- "data"
nodeSelector:
nodeType: opensearch
tolerations:
- key: "nodeType"
operator: "Equal"
value: "opensearch"
effect: "NoSchedule"
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
opster.io/opensearch-cluster: content
opster.io/opensearch-nodepool: data
- maxSkew: 1
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
opster.io/opensearch-cluster: content
opster.io/opensearch-nodepool: data
After the new node joins the cluster and the pod is assigned to that new node, the stateful set appears to scale to zero and scale back up.
Hi
I have been able to recreate this just by deleting a pod using kubectl
Sorry may or may not be helpful
Thanks
I'm also encountering this issue. I'm using Karpenter and when initiating a spot termination request, the following is logged by Karpenter controller;
2023-02-22T09:14:24.297Z INFO controller.termination cordoned node {"commit": "beb0a64-dirty", "node": "ip-10-0-166-9.ap-southeast-2.compute.internal"} 2023-02-22T09:14:24.300Z INFO controller.interruption deleted node from interruption message {"commit": "beb0a64-dirty", "queue": "KarpenterInterruptionQueue", "messageKind": "SpotInterruptionKind", "node": "ip-10-0-166-9.ap-southeast-2.compute.internal", "action": "CordonAndDrain"} 2023-02-22T09:14:55.606Z INFO controller.node added TTL to empty node {"commit": "beb0a64-dirty", "node": "ip-10-0-151-29.ap-southeast-2.compute.internal"} 2023-02-22T09:14:55.687Z INFO controller.node removed emptiness TTL from node {"commit": "beb0a64-dirty", "node": "ip-10-0-151-29.ap-southeast-2.compute.internal"} 2023-02-22T09:14:55.993Z INFO controller.node added TTL to empty node {"commit": "beb0a64-dirty", "node": "ip-10-0-105-181.ap-southeast-2.compute.internal"} 2023-02-22T09:14:56.785Z INFO controller.termination deleted node {"commit": "beb0a64-dirty", "node": "ip-10-0-166-9.ap-southeast-2.compute.internal"} 2023-02-22T09:15:00.502Z INFO controller.node removed emptiness TTL from node {"commit": "beb0a64-dirty", "node": "ip-10-0-105-181.ap-southeast-2.compute.internal"} 2023-02-22T09:15:01.543Z INFO controller.provisioner pod default/wazuh-indexer-nodes-2 has a preferred Anti-Affinity which can prevent consolidation {"commit": "beb0a64-dirty"} 2023-02-22T09:15:01.585Z INFO controller.provisioner found provisionable pod(s) {"commit": "beb0a64-dirty", "pods": 1} 2023-02-22T09:15:01.585Z INFO controller.provisioner computed new node(s) to fit pod(s) {"commit": "beb0a64-dirty", "nodes": 1, "pods": 1} 2023-02-22T09:15:01.586Z INFO controller.provisioner launching machine with 1 pods requesting {"cpu":"655m","memory":"2168Mi","pods":"4"} from types c6g.16xlarge, r6gd.2xlarge, m6gd.xlarge, m6gd.12xlarge, c6gn.xlarge and 53 other(s) {"commit": "beb0a64-dirty", "provisioner": "spot-provisioner-arm64"} 2023-02-22T09:15:06.485Z INFO controller.provisioner.cloudprovider launched new instance {"commit": "beb0a64-dirty", "provisioner": "spot-provisioner-arm64", "id": "i-04d723133149ebfdf", "hostname": "ip-10-0-168-220.ap-southeast-2.compute.internal", "instance-type": "m6g.medium", "zone": "ap-southeast-2c", "capacity-type": "spot"} 2023-02-22T09:17:52.392Z DEBUG controller.aws deleted launch template {"commit": "beb0a64-dirty"}
You can see from the logs that Karpenter detects the spot termination request, cordons the node, deletes the message. You can see my other nodes get marked as being 'empty' and thus the emptiness TTL is set because all my Opensearch pods in my statefulset are terminated and thus the node/s becomes empty. Immediately after the pods terminate, they restart and thus the emptiness TTL is removed because the node/s are no longer empty. Visibly checking my pod status also reveals that they terminate and launch again. Not sure why this is happening.
It appears to be fixed on main, following some logic added by #567
Essentially, the current code for 2.3.1 starts a rolling restart when you delete any pod of a given statefulset even if the revision has not changed, which happens when draining nodes, in my case, during AKS K8S or OS Image upgrades.
This would cause all pods of ordinal numbers lower than the deleted pod to also get terminated simultaneously.
Because the Statefulset revision does not change when simply deleting a pod or draining a node, then a rolling restart should not be done. Which is the new behavior case after #567, specifically here: https://github.com/Opster/opensearch-k8s-operator/blob/19c9924624eb4dd1067959f8f757179605e51eeb/opensearch-operator/pkg/helpers/helpers.go#L427
But this could probably be enhanced further by not triggering any rolling restart events when statefulset revisions don't change.
I have been toying with Opensearch on spot nodes. When a spot k8s node gets recalled, the opensearch node gets rescheduled as expected. What is unusual is that a rolling restart gets issued by the controller and any opensearch nodes with an index lower then the rescheduled node also are restarted. I expected the opensearch node to get rescheduled and go about it's normal business instead.
this happened in rapid succession:
where the k8s node (on EKS) containing
content-data-3
was issued an interruption request.I have tried turning off
smartScaler
and setting aPodDisruptionBudget
, with no success.