opensearch-project / helm-charts

:wheel_of_dharma: A community repository for Helm Charts of OpenSearch Project.
https://opensearch.org/docs/latest/opensearch/install/helm/
Apache License 2.0
170 stars 228 forks source link

[BUG][OpenSearch] Cluster-Manager discovery not working #499

Closed felix185 closed 10 months ago

felix185 commented 10 months ago

Describe the bug I tried to deploy a simple OpenSearch cluster with the provided helm charts in our k8s (1.26) with helm (3.13.1). As soon as I increase the number of replicas for a cluster-manager/master from 1 to 2 or 3 the cluster does not start successfully. I had to make a few adjustments in the values.yaml (i.e. providing the url of our private imageregistry and a corresponding imagepullsecret). For all other values default values are used.

To Reproduce Steps to reproduce the behavior:

  1. clone this repository
  2. open values.yaml from charts/opensearch
  3. change global.registry to private registry
  4. add image pull secret to imagePullSecrets
  5. deploy via helm upgrade opensearch ./charts/opensearch --install

Expected behavior OpenSearch is starting without errors and the request to verify OpenSearch installation as described here is successful.

Chart Name Specify the Chart which is affected? OpenSearch

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Additional context Add any other context about the problem here. If I try to run the helm chart with replicas set to 1, everything is setup as I would expect. But as soon as I increase number of replicas to 2 or 3 (3 is default from initial clone) I am getting the following exception:

org.opensearch.discovery.ClusterManagerNotDiscoveredException: null
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-2.11.0.jar:2.11.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-11-23T13:39:57,962][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-cluster-master-0] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: ClusterManagerNotDiscoveredException[null]
    at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:184) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.11.0.jar:2.11.0]
    at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.11.0.jar:2.11.0]
    at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103) ~[opensearch-2.11.0.jar:2.11.0]
Caused by: org.opensearch.discovery.ClusterManagerNotDiscoveredException
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[opensearch-2.11.0.jar:2.11.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
uncaught exception in thread [main]
ClusterManagerNotDiscoveredException[null]
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350)
    at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
    at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
    at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707)
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

When turning on trace logs I'm also facing the following:

[2023-11-24T08:11:40,318][TRACE][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] startProbe(192.168.128.139:9300) not probing local node
[2023-11-24T08:11:40,319][TRACE][o.o.d.SeedHostsResolver  ] [opensearch-cluster-master-0] resolved host [opensearch-cluster-master-headless] to [192.168.128.139:9300, 192.168.129.236:9300]
[2023-11-24T08:11:40,319][TRACE][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] probing resolved transport addresses [192.168.129.236:9300]
[2023-11-24T08:11:40,350][DEBUG][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] Peer{transportAddress=192.168.129.236:9300, discoveryNode=null, peersRequestInFlight=false} connection failed
org.opensearch.transport.ConnectTransportException: [][192.168.129.236:9300] connect_timeout[3s]
    at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1083) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[opensearch-2.11.0.jar:2.11.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-11-24T08:11:41,319][TRACE][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] probing cluster-manager nodes from cluster state: nodes: 
   {opensearch-cluster-master-0}{6ln17rDKRuS80Z40Rmt8Og}{ibQZHrlmQ_SFaIIVNythJQ}{192.168.128.139}{192.168.128.139:9300}{dimr}{shard_indexing_pressure_enabled=true}, local

It seems like the pods can find themselves as cluster-manager, but as soon as they have to peer with/discover the other cluster-manager pods, they cannot find them. If I exec into a pod and try to call the IP of one of the other pods with curl, there is also a timeout.

I'm working on a clean namespace, so there is no networkpolicy deployed. The only thing deployed apart from the helm charts is the secret to pull the images from the private registry.

felix185 commented 10 months ago

closing this issue as this has nothing to do with helm charts, but with the network configurations of the managed k8s cluster. sorry for any inconvenience caused.