[BUG][OpenSearch] Cluster-Manager discovery not working

Describe the bug I tried to deploy a simple OpenSearch cluster with the provided helm charts in our k8s (1.26) with helm (3.13.1). As soon as I increase the number of replicas for a cluster-manager/master from 1 to 2 or 3 the cluster does not start successfully. I had to make a few adjustments in the values.yaml (i.e. providing the url of our private imageregistry and a corresponding imagepullsecret). For all other values default values are used.

To Reproduce Steps to reproduce the behavior:

clone this repository
open values.yaml from charts/opensearch
change global.registry to private registry
add image pull secret to imagePullSecrets
deploy via helm upgrade opensearch ./charts/opensearch --install

Expected behavior OpenSearch is starting without errors and the request to verify OpenSearch installation as described here is successful.

Chart Name Specify the Chart which is affected? OpenSearch

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Helm Version: 3.13.1
Kubernetes Version: 1.26.7

Additional context Add any other context about the problem here. If I try to run the helm chart with replicas set to 1, everything is setup as I would expect. But as soon as I increase number of replicas to 2 or 3 (3 is default from initial clone) I am getting the following exception:

org.opensearch.discovery.ClusterManagerNotDiscoveredException: null
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-2.11.0.jar:2.11.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-11-23T13:39:57,962][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-cluster-master-0] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: ClusterManagerNotDiscoveredException[null]
    at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:184) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.11.0.jar:2.11.0]
    at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.11.0.jar:2.11.0]
    at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103) ~[opensearch-2.11.0.jar:2.11.0]
Caused by: org.opensearch.discovery.ClusterManagerNotDiscoveredException
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[opensearch-2.11.0.jar:2.11.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
uncaught exception in thread [main]
ClusterManagerNotDiscoveredException[null]
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350)
    at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
    at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
    at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707)
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

When turning on trace logs I'm also facing the following:

[2023-11-24T08:11:40,318][TRACE][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] startProbe(192.168.128.139:9300) not probing local node
[2023-11-24T08:11:40,319][TRACE][o.o.d.SeedHostsResolver  ] [opensearch-cluster-master-0] resolved host [opensearch-cluster-master-headless] to [192.168.128.139:9300, 192.168.129.236:9300]
[2023-11-24T08:11:40,319][TRACE][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] probing resolved transport addresses [192.168.129.236:9300]
[2023-11-24T08:11:40,350][DEBUG][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] Peer{transportAddress=192.168.129.236:9300, discoveryNode=null, peersRequestInFlight=false} connection failed
org.opensearch.transport.ConnectTransportException: [][192.168.129.236:9300] connect_timeout[3s]
    at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1083) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[opensearch-2.11.0.jar:2.11.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-11-24T08:11:41,319][TRACE][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] probing cluster-manager nodes from cluster state: nodes: 
   {opensearch-cluster-master-0}{6ln17rDKRuS80Z40Rmt8Og}{ibQZHrlmQ_SFaIIVNythJQ}{192.168.128.139}{192.168.128.139:9300}{dimr}{shard_indexing_pressure_enabled=true}, local

It seems like the pods can find themselves as cluster-manager, but as soon as they have to peer with/discover the other cluster-manager pods, they cannot find them. If I exec into a pod and try to call the IP of one of the other pods with curl, there is also a timeout.

I'm working on a clean namespace, so there is no networkpolicy deployed. The only thing deployed apart from the helm charts is the secret to pull the images from the private registry.

opensearch-project / helm-charts

[BUG][OpenSearch] Cluster-Manager discovery not working #499