opensearch-project / opensearch-k8s-operator

OpenSearch Kubernetes Operator
Apache License 2.0
404 stars 218 forks source link

Missing PersistenceVolume settings for bootstrap pod #897

Open evheniyt opened 2 weeks ago

evheniyt commented 2 weeks ago
          I have also experienced unstable cluster bootstrap. I have fully recreated a cluster multiple times and periodically I saw that the cluster was stacked on bootstrapping the second node.

Eventually, I have found a correlation between this issue and a recreation of bootstrap pod. We are using Karpenter and sometimes, during bootstrap process it could decide to move bootstrap pod to another node. When that happens, the cluster creation stack with this error:

opensearch [2024-10-29T06:38:10,310][WARN ][o.o.c.c.Coordinator      ] [opensearch-primary-bootstrap-0] failed to validate incoming join request from node [{opensearch-primary-nodes-0}{9zZmg5EGRpidHf_0OwLUyA}{kV9
e6qUTSsmvj1lUP-2QjA}{opensearch-primary-nodes-0}{10.152.42.19:9300}{dm}{shard_indexing_pressure_enabled=true}]                                                                                                      
opensearch org.opensearch.transport.RemoteTransportException: [opensearch-primary-nodes-0][10.152.42.19:9300][internal:cluster/coordination/join/validate_compressed]                                               
opensearch Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid 7QJiU55FRcWvBidZD_MF6A than local cluster uuid U4a0ix4h
TwCvij0JF9qoEw, rejecting

I believe it is caused by the fact that bootstrap pod is not using persistent disk, and if it is restarted it gets a new cluster UUID which is non equal with the UUID on node-0

Originally posted by @evheniyt in https://github.com/opensearch-project/opensearch-k8s-operator/issues/811#issuecomment-2443349818

evheniyt commented 2 weeks ago

@swoehrl-mw @prudhvigodithi I want to add support of PV for the bootstrap pod, WDYT?

swoehrl-mw commented 1 week ago

I want to add support of PV for the bootstrap pod, WDYT?

@evheniyt Fine for me. I think the bootstrap pod being restarted was not a scenario ever considered as it is only running for a few minutes. IMO there are no reasons against having a PV for the pod, but it should be cleaned up afterwards.