[BUG] Example Docker Compose fails in Docker Swarm

designermonkey commented 1 year ago

Describe the bug

I have proven locally that I can get the example docker compose file to work locally, yet when I try the exact same file using docker swarm mode, it will not bring the cluster up.

The first node always tries to connect to itself and fails

To Reproduce Steps to reproduce the behavior:

Use the example compose file
Run docker stack deploy --prune --with-registry-auth --compose-file docker-compose.yml
View the docker service logs for the first node
See error:

Enabling execution of install_demo_configuration.sh for OpenSearch Security Plugin
**************************************************************************
** This tool will be deprecated in the next major release of OpenSearch **
** https://github.com/opensearch-project/security/issues/1755           **
**************************************************************************
OpenSearch Security Demo Installer
 ** Warning: Do not use on production or public reachable systems **
Basedir: /usr/share/opensearch
OpenSearch install type: rpm/deb on NAME="Amazon Linux"
OpenSearch config dir: /usr/share/opensearch/config
OpenSearch config file: /usr/share/opensearch/config/opensearch.yml
OpenSearch bin dir: /usr/share/opensearch/bin
OpenSearch plugins dir: /usr/share/opensearch/plugins
OpenSearch lib dir: /usr/share/opensearch/lib
Detected OpenSearch Version: x-content-2.5.0
Detected OpenSearch Security Version: 2.5.0.0

### Success
### Execute this script now on all your nodes and then start all nodes
### OpenSearch Security will be automatically initialized.
### If you like to change the runtime configuration 
### change the files in ../../../config/opensearch-security and execute: 
"/usr/share/opensearch/plugins/opensearch-security/tools/securityadmin.sh" -cd "/usr/share/opensearch/config/opensearch-security" -icl -key "/usr/share/opensearch/config/kirk-key.pem" -cert "/usr/share/opensearch/config/kirk.pem" -cacert "/usr/share/opensearch/config/root-ca.pem" -nhnv
### or run ./securityadmin_demo.sh
### To use the Security Plugin ConfigurationGUI
### To access your secured cluster open https://<hostname>:<HTTP port> and log in with admin/admin.
### (Ignore the SSL certificate warning because we installed self-signed demo certificates)
Enabling OpenSearch Security Plugin
Enabling execution of OPENSEARCH_HOME/bin/opensearch-performance-analyzer/performance-analyzer-agent-cli for OpenSearch Performance Analyzer Plugin
[2023-02-16T16:23:42,004][INFO ][o.o.n.Node               ] [node01] version[2.5.0], pid[103], build[tar/b8a8b6c4d7fc7a7e32eb2cb68ecad8057a4636ad/2023-01-18T23:49:00.584806002Z], OS[Linux/5.10.104-linuxkit/aarch64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/17.0.5/17.0.5+8]
[2023-02-16T16:23:42,005][INFO ][o.o.n.Node               ] [node01] JVM home [/usr/share/opensearch/jdk], using bundled JDK [true]
[2023-02-16T16:23:42,005][INFO ][o.o.n.Node               ] [node01] JVM arguments [-Xshare:auto, -Dopensearch.networkaddress.cache.ttl=60, -Dopensearch.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -XX:+ShowCodeDetailsInExceptionMessages, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=SPI,COMPAT, -Xms1g, -Xmx1g, -XX:+UseG1GC, -XX:G1ReservePercent=25, -XX:InitiatingHeapOccupancyPercent=30, -Djava.io.tmpdir=/tmp/opensearch-7599599729676416352, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Dclk.tck=100, -Djdk.attach.allowAttachSelf=true, -Djava.security.policy=/usr/share/opensearch/config/opensearch-performance-analyzer/opensearch_security.policy, --add-opens=jdk.attach/sun.tools.attach=ALL-UNNAMED, -Dopensearch.cgroups.hierarchy.override=/, -Xms2048m, -Xmx2048m, -XX:MaxDirectMemorySize=1073741824, -Dopensearch.path.home=/usr/share/opensearch, -Dopensearch.path.conf=/usr/share/opensearch/config, -Dopensearch.distribution.type=tar, -Dopensearch.bundled_jdk=true]
[2023-02-16T16:23:43,095][WARN ][stderr                   ] [node01] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
[2023-02-16T16:23:43,095][WARN ][stderr                   ] [node01] SLF4J: Defaulting to no-operation (NOP) logger implementation
[2023-02-16T16:23:43,095][WARN ][stderr                   ] [node01] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[2023-02-16T16:23:43,109][INFO ][o.o.s.s.t.SSLConfig      ] [node01] SSL dual mode is disabled
[2023-02-16T16:23:43,109][INFO ][o.o.s.OpenSearchSecurityPlugin] [node01] OpenSearch Config path is /usr/share/opensearch/config
[2023-02-16T16:23:43,648][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node01] JVM supports TLSv1.3
[2023-02-16T16:23:43,649][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node01] Config directory is /usr/share/opensearch/config/, from there the key- and truststore files are resolved relatively
[2023-02-16T16:23:44,113][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node01] TLS Transport Client Provider : JDK
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.OpenSearch (file:/usr/share/opensearch/lib/opensearch-2.5.0.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.OpenSearch
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.Security (file:/usr/share/opensearch/lib/opensearch-2.5.0.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.Security
WARNING: System::setSecurityManager will be removed in a future release
[2023-02-16T16:23:44,113][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node01] TLS Transport Server Provider : JDK
[2023-02-16T16:23:44,113][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node01] TLS HTTP Provider             : JDK
[2023-02-16T16:23:44,114][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node01] Enabled TLS protocols for transport layer : [TLSv1.3, TLSv1.2]
[2023-02-16T16:23:44,114][INFO ][o.o.s.s.DefaultSecurityKeyStore] [node01] Enabled TLS protocols for HTTP layer      : [TLSv1.3, TLSv1.2]
[2023-02-16T16:23:44,119][INFO ][o.o.s.OpenSearchSecurityPlugin] [node01] Clustername: cluster
[2023-02-16T16:23:44,123][WARN ][o.o.s.OpenSearchSecurityPlugin] [node01] Directory /usr/share/opensearch/config has insecure file permissions (should be 0700)
[2023-02-16T16:23:44,123][WARN ][o.o.s.OpenSearchSecurityPlugin] [node01] File /usr/share/opensearch/config/esnode-key.pem has insecure file permissions (should be 0600)
[2023-02-16T16:23:44,123][WARN ][o.o.s.OpenSearchSecurityPlugin] [node01] File /usr/share/opensearch/config/kirk.pem has insecure file permissions (should be 0600)
[2023-02-16T16:23:44,123][WARN ][o.o.s.OpenSearchSecurityPlugin] [node01] File /usr/share/opensearch/config/root-ca.pem has insecure file permissions (should be 0600)
[2023-02-16T16:23:44,124][WARN ][o.o.s.OpenSearchSecurityPlugin] [node01] File /usr/share/opensearch/config/esnode.pem has insecure file permissions (should be 0600)
[2023-02-16T16:23:44,124][WARN ][o.o.s.OpenSearchSecurityPlugin] [node01] File /usr/share/opensearch/config/kirk-key.pem has insecure file permissions (should be 0600)
[2023-02-16T16:23:44,453][INFO ][o.o.p.c.PluginSettings   ] [node01] Config: metricsLocation: /dev/shm/performanceanalyzer/, metricsDeletionInterval: 1, httpsEnabled: false, cleanup-metrics-db-files: true, batch-metrics-retention-period-minutes: 7, rpc-port: 9650, webservice-port 9600
[2023-02-16T16:23:44,761][INFO ][o.o.i.r.ReindexPlugin    ] [node01] ReindexPlugin reloadSPI called
[2023-02-16T16:23:44,761][INFO ][o.o.i.r.ReindexPlugin    ] [node01] Unable to find any implementation for RemoteReindexExtension
[2023-02-16T16:23:44,786][INFO ][o.o.j.JobSchedulerPlugin ] [node01] Loaded scheduler extension: reports-scheduler, index: .opendistro-reports-definitions
[2023-02-16T16:23:44,788][INFO ][o.o.j.JobSchedulerPlugin ] [node01] Loaded scheduler extension: opendistro_anomaly_detector, index: .opendistro-anomaly-detector-jobs
[2023-02-16T16:23:44,789][INFO ][o.o.j.JobSchedulerPlugin ] [node01] Loaded scheduler extension: opendistro-index-management, index: .opendistro-ism-config
[2023-02-16T16:23:44,798][INFO ][o.o.j.JobSchedulerPlugin ] [node01] Loaded scheduler extension: observability, index: .opensearch-observability-job
[2023-02-16T16:23:44,801][INFO ][o.o.p.PluginsService     ] [node01] loaded module [aggs-matrix-stats]
[2023-02-16T16:23:44,801][INFO ][o.o.p.PluginsService     ] [node01] loaded module [analysis-common]
[2023-02-16T16:23:44,801][INFO ][o.o.p.PluginsService     ] [node01] loaded module [geo]
[2023-02-16T16:23:44,801][INFO ][o.o.p.PluginsService     ] [node01] loaded module [ingest-common]
[2023-02-16T16:23:44,801][INFO ][o.o.p.PluginsService     ] [node01] loaded module [ingest-geoip]
[2023-02-16T16:23:44,801][INFO ][o.o.p.PluginsService     ] [node01] loaded module [ingest-user-agent]
[2023-02-16T16:23:44,801][INFO ][o.o.p.PluginsService     ] [node01] loaded module [lang-expression]
[2023-02-16T16:23:44,801][INFO ][o.o.p.PluginsService     ] [node01] loaded module [lang-mustache]
[2023-02-16T16:23:44,801][INFO ][o.o.p.PluginsService     ] [node01] loaded module [lang-painless]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded module [mapper-extras]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded module [opensearch-dashboards]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded module [parent-join]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded module [percolator]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded module [rank-eval]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded module [reindex]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded module [repository-url]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded module [systemd]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded module [transport-netty4]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-alerting]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-anomaly-detection]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-asynchronous-search]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-cross-cluster-replication]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-geospatial]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-index-management]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-job-scheduler]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-knn]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-ml]
[2023-02-16T16:23:44,802][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-neural-search]
[2023-02-16T16:23:44,803][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-notifications]
[2023-02-16T16:23:44,803][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-notifications-core]
[2023-02-16T16:23:44,803][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-observability]
[2023-02-16T16:23:44,803][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-performance-analyzer]
[2023-02-16T16:23:44,803][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-reports-scheduler]
[2023-02-16T16:23:44,803][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-security]
[2023-02-16T16:23:44,803][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-security-analytics]
[2023-02-16T16:23:44,803][INFO ][o.o.p.PluginsService     ] [node01] loaded plugin [opensearch-sql]
[2023-02-16T16:23:44,811][INFO ][o.o.s.OpenSearchSecurityPlugin] [node01] Disabled https compression by default to mitigate BREACH attacks. You can enable it by setting 'http.compression: true' in opensearch.yml
[2023-02-16T16:23:44,832][DEPRECATION][o.o.d.c.s.Settings       ] [node01] [node.max_local_storage_nodes] setting was deprecated in OpenSearch and will be removed in a future release! See the breaking changes documentation for the next major version.
[2023-02-16T16:23:44,839][INFO ][o.o.e.NodeEnvironment    ] [node01] using [1] data paths, mounts [[/usr/share/opensearch/data (virtiofs0)]], net usable_space [240gb], net total_space [460.4gb], types [virtiofs]
[2023-02-16T16:23:44,839][INFO ][o.o.e.NodeEnvironment    ] [node01] heap size [2gb], compressed ordinary object pointers [true]
[2023-02-16T16:23:44,953][INFO ][o.o.n.Node               ] [node01] node name [node01], node ID [Bqk8Khh8R5GjoDQaF-C-Cg], cluster name [cluster], roles [ingest, remote_cluster_client, data, cluster_manager]
[2023-02-16T16:23:47,055][WARN ][o.o.s.c.Salt             ] [node01] If you plan to use field masking pls configure compliance salt e1ukloTsQlOgPquJ to be a random string of 16 chars length identical on all nodes
[2023-02-16T16:23:47,076][INFO ][o.o.s.a.i.AuditLogImpl   ] [node01] Message routing enabled: true
[2023-02-16T16:23:47,098][INFO ][o.o.s.f.SecurityFilter   ] [node01] <NONE> indices are made immutable.
[2023-02-16T16:23:47,289][INFO ][o.o.a.b.ADCircuitBreakerService] [node01] Registered memory breaker.
[2023-02-16T16:23:47,498][INFO ][o.o.m.b.MLCircuitBreakerService] [node01] Registered ML memory breaker.
[2023-02-16T16:23:47,499][INFO ][o.o.m.b.MLCircuitBreakerService] [node01] Registered ML disk breaker.
[2023-02-16T16:23:47,499][INFO ][o.o.m.b.MLCircuitBreakerService] [node01] Registered ML native memory breaker.
[2023-02-16T16:23:47,574][INFO ][o.r.Reflections          ] [node01] Reflections took 31 ms to scan 1 urls, producing 12 keys and 32 values 
[2023-02-16T16:23:48,095][INFO ][o.o.t.NettyAllocator     ] [node01] creating NettyAllocator with the following configs: [name=opensearch_configured, chunk_size=256kb, suggested_max_allocation_size=256kb, factors={opensearch.unsafe.use_netty_default_chunk_and_page_size=false, g1gc_enabled=true, g1gc_region_size=1mb}]
[2023-02-16T16:23:48,133][INFO ][o.o.d.DiscoveryModule    ] [node01] using discovery type [zen] and seed hosts providers [settings]
[2023-02-16T16:23:48,416][WARN ][o.o.g.DanglingIndicesState] [node01] gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually
[2023-02-16T16:23:48,663][INFO ][o.o.p.h.c.PerformanceAnalyzerConfigAction] [node01] PerformanceAnalyzer Enabled: false
[2023-02-16T16:23:48,681][INFO ][o.o.n.Node               ] [node01] initialized
[2023-02-16T16:23:48,681][INFO ][o.o.n.Node               ] [node01] starting ...
[2023-02-16T16:23:48,823][INFO ][o.o.t.TransportService   ] [node01] publish_address {10.0.0.212:9300}, bound_addresses {0.0.0.0:9300}
[2023-02-16T16:23:48,981][INFO ][o.o.b.BootstrapChecks    ] [node01] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2023-02-16T16:23:48,984][INFO ][o.o.c.c.Coordinator      ] [node01] cluster UUID [b0uSEvAVTFSEd_HDG0NTjw]
[2023-02-16T16:23:58,998][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node01] cluster-manager not discovered or elected yet, an election requires a node with id [enQl9djoRA24TYJYOUGMnw], have discovered [{node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, {node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true}] which is a quorum; discovery will continue using [10.0.24.2:9300, 10.0.24.5:9300] from hosts providers and [{node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 70, last-accepted version 18 in term 68
[2023-02-16T16:24:09,013][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node01] cluster-manager not discovered or elected yet, an election requires a node with id [enQl9djoRA24TYJYOUGMnw], have discovered [{node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, {node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true}] which is a quorum; discovery will continue using [10.0.24.2:9300, 10.0.24.5:9300] from hosts providers and [{node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 70, last-accepted version 18 in term 68
[2023-02-16T16:24:18,996][WARN ][o.o.n.Node               ] [node01] timed out while waiting for initial discovery state - timeout: 30s
[2023-02-16T16:24:19,010][INFO ][o.o.h.AbstractHttpServerTransport] [node01] publish_address {10.0.0.212:9200}, bound_addresses {0.0.0.0:9200}
[2023-02-16T16:24:19,010][INFO ][o.o.n.Node               ] [node01] started
[2023-02-16T16:24:19,011][INFO ][o.o.s.OpenSearchSecurityPlugin] [node01] Node started
[2023-02-16T16:24:19,011][INFO ][o.o.s.c.ConfigurationRepository] [node01] Will attempt to create index .opendistro_security and default configs if they are absent
[2023-02-16T16:24:19,012][INFO ][o.o.s.OpenSearchSecurityPlugin] [node01] 0 OpenSearch Security modules loaded so far: []
[2023-02-16T16:24:19,013][INFO ][o.o.s.c.ConfigurationRepository] [node01] Background init thread started. Install default config?: true
[2023-02-16T16:24:19,018][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node01] cluster-manager not discovered or elected yet, an election requires a node with id [enQl9djoRA24TYJYOUGMnw], have discovered [{node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, {node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true}] which is a quorum; discovery will continue using [10.0.24.2:9300, 10.0.24.5:9300] from hosts providers and [{node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 70, last-accepted version 18 in term 68
[2023-02-16T16:24:20,416][INFO ][o.o.c.c.JoinHelper       ] [node01] failed to join {node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, minimumTerm=70, optionalJoin=Optional[Join{term=70, lastAcceptedTerm=68, lastAcceptedVersion=18, sourceNode={node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, targetNode={node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true}}]}
org.opensearch.transport.RemoteTransportException: [node02][10.0.24.6:9300][internal:cluster/coordination/join]
Caused by: org.opensearch.transport.ConnectTransportException: [node01][10.0.0.212:9300] connect_timeout[30s]
        at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1082) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-2.5.0.jar:2.5.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-02-16T16:24:20,429][INFO ][o.o.c.c.JoinHelper       ] [node01] failed to join {node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, minimumTerm=70, optionalJoin=Optional[Join{term=70, lastAcceptedTerm=68, lastAcceptedVersion=18, sourceNode={node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, targetNode={node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true}}]}
org.opensearch.transport.RemoteTransportException: [node02][10.0.24.6:9300][internal:cluster/coordination/join]
Caused by: org.opensearch.transport.ConnectTransportException: [node01][10.0.0.212:9300] connect_timeout[30s]
        at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1082) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-2.5.0.jar:2.5.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-02-16T16:24:29,023][WARN ][o.o.c.c.JoinHelper       ] [node01] last failed join attempt was 8.5s ago, failed to join {node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, minimumTerm=70, optionalJoin=Optional[Join{term=70, lastAcceptedTerm=68, lastAcceptedVersion=18, sourceNode={node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, targetNode={node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true}}]}
org.opensearch.transport.RemoteTransportException: [node02][10.0.24.6:9300][internal:cluster/coordination/join]
Caused by: org.opensearch.transport.ConnectTransportException: [node01][10.0.0.212:9300] connect_timeout[30s]
        at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1082) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-2.5.0.jar:2.5.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-02-16T16:24:29,025][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node01] cluster-manager not discovered or elected yet, an election requires a node with id [enQl9djoRA24TYJYOUGMnw], have discovered [{node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, {node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true}] which is a quorum; discovery will continue using [10.0.24.2:9300, 10.0.24.5:9300] from hosts providers and [{node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 70, last-accepted version 18 in term 68
[2023-02-16T16:24:39,034][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node01] cluster-manager not discovered or elected yet, an election requires a node with id [enQl9djoRA24TYJYOUGMnw], have discovered [{node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, {node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true}] which is a quorum; discovery will continue using [10.0.24.2:9300, 10.0.24.5:9300] from hosts providers and [{node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 70, last-accepted version 18 in term 68
[2023-02-16T16:24:49,025][ERROR][o.o.s.c.ConfigurationRepository] [node01] Cannot apply default config (this is maybe not an error!)
org.opensearch.discovery.ClusterManagerNotDiscoveredException: null
        at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:348) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-2.5.0.jar:2.5.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-02-16T16:24:49,032][ERROR][o.o.s.c.ConfigurationLoaderSecurity7] [node01] Exception while retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)
org.opensearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
        at org.opensearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:205) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:191) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.action.get.TransportMultiGetAction.doExecute(TransportMultiGetAction.java:81) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.action.get.TransportMultiGetAction.doExecute(TransportMultiGetAction.java:58) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:218) [opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:118) [opensearch-index-management-2.5.0.0.jar:2.5.0.0]
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) [opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:232) [opensearch-security-2.5.0.0.jar:2.5.0.0]
        at org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:149) [opensearch-security-2.5.0.0.jar:2.5.0.0]
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) [opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:78) [opensearch-performance-analyzer-2.5.0.0.jar:2.5.0.0]
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) [opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.action.support.TransportAction.execute(TransportAction.java:188) [opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.action.support.TransportAction.execute(TransportAction.java:107) [opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:110) [opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:97) [opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:461) [opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.client.support.AbstractClient.multiGet(AbstractClient.java:577) [opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.security.configuration.ConfigurationLoaderSecurity7.loadAsync(ConfigurationLoaderSecurity7.java:208) [opensearch-security-2.5.0.0.jar:2.5.0.0]
        at org.opensearch.security.configuration.ConfigurationLoaderSecurity7.load(ConfigurationLoaderSecurity7.java:99) [opensearch-security-2.5.0.0.jar:2.5.0.0]
        at org.opensearch.security.configuration.ConfigurationRepository.getConfigurationsFromIndex(ConfigurationRepository.java:372) [opensearch-security-2.5.0.0.jar:2.5.0.0]
        at org.opensearch.security.configuration.ConfigurationRepository.reloadConfiguration0(ConfigurationRepository.java:318) [opensearch-security-2.5.0.0.jar:2.5.0.0]
        at org.opensearch.security.configuration.ConfigurationRepository.reloadConfiguration(ConfigurationRepository.java:303) [opensearch-security-2.5.0.0.jar:2.5.0.0]
        at org.opensearch.security.configuration.ConfigurationRepository$1.run(ConfigurationRepository.java:163) [opensearch-security-2.5.0.0.jar:2.5.0.0]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-02-16T16:24:49,035][WARN ][o.o.c.c.ClusterFormationFailureHelper] [node01] cluster-manager not discovered or elected yet, an election requires a node with id [enQl9djoRA24TYJYOUGMnw], have discovered [{node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, {node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true}] which is a quorum; discovery will continue using [10.0.24.2:9300, 10.0.24.5:9300] from hosts providers and [{node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 70, last-accepted version 18 in term 68
[2023-02-16T16:24:51,126][INFO ][o.o.c.c.JoinHelper       ] [node01] failed to join {node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, minimumTerm=70, optionalJoin=Optional[Join{term=70, lastAcceptedTerm=68, lastAcceptedVersion=18, sourceNode={node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, targetNode={node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true}}]}
org.opensearch.transport.RemoteTransportException: [node02][10.0.24.6:9300][internal:cluster/coordination/join]
Caused by: org.opensearch.transport.ConnectTransportException: [node01][10.0.0.212:9300] connect_exception
        at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1076) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:215) ~[opensearch-2.5.0.jar:2.5.0]
        at org.opensearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:55) ~[opensearch-core-2.5.0.jar:2.5.0]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
        at org.opensearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:70) ~[opensearch-core-2.5.0.jar:2.5.0]
        at org.opensearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:81) ~[transport-netty4-client-2.5.0.jar:2.5.0]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:262) ~[netty-transport-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) [netty-transport-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.86.Final.jar:4.1.86.Final]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.io.IOException: connection timed out: 10.0.0.212/10.0.0.212:9300
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[netty-transport-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[netty-common-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) [netty-transport-4.1.86.Final.jar:4.1.86.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]

This continues forever.

Expected behavior I would expect the node can join the cluster and elect a manager node exactly as it is capable of doing in docker compose.

Plugins Nothing but default.

Host/Environment (please complete the following information):

OS: MacOS Monteray
Version 12.6.3

Additional context

As far as I can tell, there is nothing wrong with the docker networking.

designermonkey commented 1 year ago

I neglected to mention I changed all references of opensearch-node1 to node01 and all references of opensearch-node2 to node02.

dbwiddis commented 1 year ago

The nodes are failing to form a cluster:

2023-02-16T16:24:20,416][INFO ][o.o.c.c.JoinHelper       ] [node01] failed to join {node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, minimumTerm=70, optionalJoin=Optional[Join{term=70, lastAcceptedTerm=68, lastAcceptedVersion=18, sourceNode={node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, targetNode={node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true}}]}
org.opensearch.transport.RemoteTransportException: [node02][10.0.24.6:9300][internal:cluster/coordination/join]
Caused by: org.opensearch.transport.ConnectTransportException: [node01][10.0.0.212:9300] connect_timeout[30s]
        at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout([TcpTransport.java:1082](http://tcptransport.java:1082/)) ~[opensearch-2.5.0.jar:2.5.0]
        at [org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run](http://org.opensearch.common.util.concurrent.threadcontext%24contextpreservingrunnable.run/)([ThreadContext.java:747](http://threadcontext.java:747/)) ~[opensearch-2.5.0.jar:2.5.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker([ThreadPoolExecutor.java:1136](http://threadpoolexecutor.java:1136/)) ~[?:?]
        at [java.util.concurrent.ThreadPoolExecutor$Worker.run](http://java.util.concurrent.threadpoolexecutor%24worker.run/)([ThreadPoolExecutor.java:635](http://threadpoolexecutor.java:635/)) ~[?:?]
        at [java.lang.Thread.run](http://java.lang.thread.run/)([Thread.java:833](http://thread.java:833/)) [?:?]

I haven't used Docker Swarm but a brief investigation shows multiple configuration options such as scaling or secure communications that I think could possibly conflict with the OpenSearch cluster model. Can you give a few more details about your configuration, compose file, etc.? It seems like we're trying multiple different ways of running multiple containers and having them talk to each other.

designermonkey commented 1 year ago

Here's a slightly modified compose file; I went right back to basics and made another discovery:

---
version: '3'
services:
  opensearch-node1:
    image: opensearchproject/opensearch:2.5.0
    container_name: opensearch-node1
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node1
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_master_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true # along with the memlock settings below, disables swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # minimum and maximum Java heap size, recommend setting both to 50% of system RAM
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536 # maximum number of open files for the OpenSearch user, set to at least 65536 on modern systems
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data
    ports:
      - 9200:9200
    #   - 9600:9600 # required for Performance Analyzer
    networks:
      - opensearch-net
  opensearch-node2:
    image: opensearchproject/opensearch:2.5.0
    container_name: opensearch-node2
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node2
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_master_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - opensearch-data2:/usr/share/opensearch/data
    networks:
      - opensearch-net
  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.5.0
    container_name: opensearch-dashboards
    ports:
      - 5601:5601
    environment:
      OPENSEARCH_HOSTS: '["https://opensearch-node1:9200","https://opensearch-node2:9200"]' # must be a string with no spaces when specified as an environment variable
    networks:
      - opensearch-net

volumes:
  opensearch-data1:
  opensearch-data2:

networks:
  opensearch-net:

I have discovered something exciting. If I don't bind the ports on the opensearch-node1, then it works fine, but if I uncomment the port binding, it is never able to join the cluster.

designermonkey commented 1 year ago

I've been experimenting to see if it is docker networking that is causing the issue, but it isn't as other services in swarm mode can map ports perfectly well, and can communicate with other containers on multiple swarm networks.

It's definitely something in the configuration of opensearch. I will do a little more investigating today, but I know little about how this all works.

designermonkey commented 1 year ago

It seems that the first node joins the docker swarm ingress network, while any others always join the service network. I have no idea what to do here.

For reference" https://stackoverflow.com/questions/70141442/elasticsearch-cluster-doesnt-work-on-docker-swarm

designermonkey commented 1 year ago

After some experimentation, here are my findings:

If a docker port is mapped, then network.publish_host: _eth1_ must be added.
If no docker port is mapped, then network.publish_host: _eth0_ must be added.

This ensures that the instances communicate on the same networks and therefore can see each other. This is highly unreliable, of course, and I'm wondering if there may be a better way of discovering the right network and port configuration inside the containers.

minalsha commented 1 year ago

@bbarani can you please help on this?

peterzhuamazon commented 1 year ago

Hi @designermonkey we only test the docker-compose file for simple docker-compose, and did not test on docker swarm.

If you are doing more complicated setup, you can try our helm charts repo which deployed to Kube, and has been actively contributed and tested by community: https://github.com/opensearch-project/helm-charts

Adding @jeffh-aws to take a look on the possible options with docker swarm/ Thanks.

jerrac commented 1 year ago

@designermonkey I was having fun getting OpenSearch to work on Docker Swarm today. Your tip about setting the network.publish_host setting helped. Though I ended up with network.host: "_eth0_" where eth0 was the nic with the ip of the Docker overlay network I was putting my containers on. So thanks for posting this issue!

I also ran into issues when I had set the memory limit too low, and when my host vms had vm.max_map_count set to low.

I have 2 of my instances connected. The third is hitting some weird java exception errors that seem to only happen on the specific worker node... A topic for a different location though.

@peterzhuamazon It would be awesome to get more support for Docker Swarm out there. I get that the big companies all use Kube. But Kube is not exactly friendly for smaller organizations.

I almost went into a whole spiel on this topic, but this really isn't the place for it. If anyone is interested, feel free to email me. :)

dblock commented 1 year ago

Let's move this to opensearch-devops.

gaiksaya commented 1 year ago

[Triage] @CEHENKLE @elfisher Any thoughts about onboarding to docker swarm?

jbates58 commented 1 year ago

i'll throw my hand in here, I too am getting this same issue. I made a post on the forum about it with my compose file, aswell as log outputs.

https://forum.opensearch.org/t/multi-node-docker-setup-not-working/15235/3

jordarlu commented 7 months ago

Looping in @pallavipr and @bbarani for comments on supporting docker swarm. thanks.

perry-mitchell commented 1 month ago

I suppose there hasn't been any movement here? I'm seeing the exact same issue with docker compose, locally, so I don't think it's related to swarm, nor is it fixed, at least. My config:

services:
  opensearch-node1:
    image: opensearchproject/opensearch:latest
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node1
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true
      # - plugins.security.disabled=true
      # - cluster.routing.allocation.enable=all
      - 'DISABLE_INSTALL_DEMO_CONFIG=true'
      - 'DISABLE_SECURITY_PLUGIN=true'
      - 'OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m'
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=teSt!1
    volumes:
      - opensearch-data1:/usr/share/opensearch/data
    ports:
      - 9200:9200 # REST API
      - 9600:9600 # Performance Analyzer
    networks:
      - opensearch-net
      - otel-net
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
  opensearch-node2:
    image: opensearchproject/opensearch:latest
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node2
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true
      # - plugins.security.disabled=true
      # - cluster.routing.allocation.enable=all
      - 'DISABLE_INSTALL_DEMO_CONFIG=true'
      - 'DISABLE_SECURITY_PLUGIN=true'
      - 'OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m'
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=teSt!1
    volumes:
      - opensearch-data2:/usr/share/opensearch/data
    networks:
      - opensearch-net
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
  # opensearch-dashboard:
  #   image: opensearchproject/opensearch-dashboards:latest
  #   ports:
  #     - 5601:5601
  #   expose:
  #     - '5601'
  #   environment:
  #     DISABLE_SECURITY_DASHBOARDS_PLUGIN: 'true'
  #     OPENSEARCH_HOSTS: '["http://opensearch-node1:9200","http://opensearch-node2:9200"]'
  #   networks:
  #     - opensearch-net

volumes:
  opensearch-data1:
  opensearch-data2:

networks:
  opensearch-net:

Seeing errors such as:

opensearch-node2-1  | [2024-06-19T08:28:03,609][INFO ][o.o.c.c.JoinHelper       ] [opensearch-node2] failed to join {opensearch-node1}{hNhhBK9MR4q5jugObtxpRw}{edTUBzEaR1qsvz0GM4gDBg}{172.18.0.2}{172.18.0.2:9300}{dimr}{shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={opensearch-node2}{p80y6oPlSuG2MKrQltpAzA}{gtBnsp2ASKS8LPyVmM_xFA}{172.19.0.3}{172.19.0.3:9300}{dimr}{shard_indexing_pressure_enabled=true}, minimumTerm=2, optionalJoin=Optional[Join{term=3, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={opensearch-node2}{p80y6oPlSuG2MKrQltpAzA}{gtBnsp2ASKS8LPyVmM_xFA}{172.19.0.3}{172.19.0.3:9300}{dimr}{shard_indexing_pressure_enabled=true}, targetNode={opensearch-node1}{hNhhBK9MR4q5jugObtxpRw}{edTUBzEaR1qsvz0GM4gDBg}{172.18.0.2}{172.18.0.2:9300}{dimr}{shard_indexing_pressure_enabled=true}}]}

opensearch-project / opensearch-devops

[BUG] Example Docker Compose fails in Docker Swarm #113