Open designermonkey opened 1 year ago
I neglected to mention I changed all references of opensearch-node1
to node01
and all references of opensearch-node2
to node02
.
The nodes are failing to form a cluster:
2023-02-16T16:24:20,416][INFO ][o.o.c.c.JoinHelper ] [node01] failed to join {node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, minimumTerm=70, optionalJoin=Optional[Join{term=70, lastAcceptedTerm=68, lastAcceptedVersion=18, sourceNode={node01}{Bqk8Khh8R5GjoDQaF-C-Cg}{brqDhviuRpKIrMGzzWX7Xw}{10.0.0.212}{10.0.0.212:9300}{dimr}{shard_indexing_pressure_enabled=true}, targetNode={node02}{enQl9djoRA24TYJYOUGMnw}{L3OlmhvHRY2Y1HxH_wibhQ}{10.0.24.6}{10.0.24.6:9300}{dimr}{shard_indexing_pressure_enabled=true}}]}
org.opensearch.transport.RemoteTransportException: [node02][10.0.24.6:9300][internal:cluster/coordination/join]
Caused by: org.opensearch.transport.ConnectTransportException: [node01][10.0.0.212:9300] connect_timeout[30s]
at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout([TcpTransport.java:1082](http://tcptransport.java:1082/)) ~[opensearch-2.5.0.jar:2.5.0]
at [org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run](http://org.opensearch.common.util.concurrent.threadcontext%24contextpreservingrunnable.run/)([ThreadContext.java:747](http://threadcontext.java:747/)) ~[opensearch-2.5.0.jar:2.5.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker([ThreadPoolExecutor.java:1136](http://threadpoolexecutor.java:1136/)) ~[?:?]
at [java.util.concurrent.ThreadPoolExecutor$Worker.run](http://java.util.concurrent.threadpoolexecutor%24worker.run/)([ThreadPoolExecutor.java:635](http://threadpoolexecutor.java:635/)) ~[?:?]
at [java.lang.Thread.run](http://java.lang.thread.run/)([Thread.java:833](http://thread.java:833/)) [?:?]
I haven't used Docker Swarm but a brief investigation shows multiple configuration options such as scaling or secure communications that I think could possibly conflict with the OpenSearch cluster model. Can you give a few more details about your configuration, compose file, etc.? It seems like we're trying multiple different ways of running multiple containers and having them talk to each other.
Here's a slightly modified compose file; I went right back to basics and made another discovery:
---
version: '3'
services:
opensearch-node1:
image: opensearchproject/opensearch:2.5.0
container_name: opensearch-node1
environment:
- cluster.name=opensearch-cluster
- node.name=opensearch-node1
- discovery.seed_hosts=opensearch-node1,opensearch-node2
- cluster.initial_master_nodes=opensearch-node1,opensearch-node2
- bootstrap.memory_lock=true # along with the memlock settings below, disables swapping
- "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # minimum and maximum Java heap size, recommend setting both to 50% of system RAM
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536 # maximum number of open files for the OpenSearch user, set to at least 65536 on modern systems
hard: 65536
volumes:
- opensearch-data1:/usr/share/opensearch/data
ports:
- 9200:9200
# - 9600:9600 # required for Performance Analyzer
networks:
- opensearch-net
opensearch-node2:
image: opensearchproject/opensearch:2.5.0
container_name: opensearch-node2
environment:
- cluster.name=opensearch-cluster
- node.name=opensearch-node2
- discovery.seed_hosts=opensearch-node1,opensearch-node2
- cluster.initial_master_nodes=opensearch-node1,opensearch-node2
- bootstrap.memory_lock=true
- "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
volumes:
- opensearch-data2:/usr/share/opensearch/data
networks:
- opensearch-net
opensearch-dashboards:
image: opensearchproject/opensearch-dashboards:2.5.0
container_name: opensearch-dashboards
ports:
- 5601:5601
environment:
OPENSEARCH_HOSTS: '["https://opensearch-node1:9200","https://opensearch-node2:9200"]' # must be a string with no spaces when specified as an environment variable
networks:
- opensearch-net
volumes:
opensearch-data1:
opensearch-data2:
networks:
opensearch-net:
I have discovered something exciting. If I don't bind the ports on the opensearch-node1
, then it works fine, but if I uncomment the port binding, it is never able to join the cluster.
I've been experimenting to see if it is docker networking that is causing the issue, but it isn't as other services in swarm mode can map ports perfectly well, and can communicate with other containers on multiple swarm networks.
It's definitely something in the configuration of opensearch. I will do a little more investigating today, but I know little about how this all works.
It seems that the first node joins the docker swarm ingress network, while any others always join the service network. I have no idea what to do here.
For reference" https://stackoverflow.com/questions/70141442/elasticsearch-cluster-doesnt-work-on-docker-swarm
After some experimentation, here are my findings:
network.publish_host: _eth1_
must be added.network.publish_host: _eth0_
must be added.This ensures that the instances communicate on the same networks and therefore can see each other. This is highly unreliable, of course, and I'm wondering if there may be a better way of discovering the right network and port configuration inside the containers.
@bbarani can you please help on this?
Hi @designermonkey we only test the docker-compose file for simple docker-compose
, and did not test on docker swarm.
If you are doing more complicated setup, you can try our helm charts repo which deployed to Kube, and has been actively contributed and tested by community: https://github.com/opensearch-project/helm-charts
Adding @jeffh-aws to take a look on the possible options with docker swarm/ Thanks.
@designermonkey I was having fun getting OpenSearch to work on Docker Swarm today. Your tip about setting the network.publish_host
setting helped. Though I ended up with network.host: "_eth0_"
where eth0 was the nic with the ip of the Docker overlay network I was putting my containers on. So thanks for posting this issue!
I also ran into issues when I had set the memory limit too low, and when my host vms had vm.max_map_count
set to low.
I have 2 of my instances connected. The third is hitting some weird java exception errors that seem to only happen on the specific worker node... A topic for a different location though.
@peterzhuamazon It would be awesome to get more support for Docker Swarm out there. I get that the big companies all use Kube. But Kube is not exactly friendly for smaller organizations.
I almost went into a whole spiel on this topic, but this really isn't the place for it. If anyone is interested, feel free to email me. :)
Let's move this to opensearch-devops.
[Triage] @CEHENKLE @elfisher Any thoughts about onboarding to docker swarm?
i'll throw my hand in here, I too am getting this same issue. I made a post on the forum about it with my compose file, aswell as log outputs.
https://forum.opensearch.org/t/multi-node-docker-setup-not-working/15235/3
Looping in @pallavipr and @bbarani for comments on supporting docker swarm. thanks.
I suppose there hasn't been any movement here? I'm seeing the exact same issue with docker compose, locally, so I don't think it's related to swarm, nor is it fixed, at least. My config:
services:
opensearch-node1:
image: opensearchproject/opensearch:latest
environment:
- cluster.name=opensearch-cluster
- node.name=opensearch-node1
- discovery.seed_hosts=opensearch-node1,opensearch-node2
- cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2
- bootstrap.memory_lock=true
# - plugins.security.disabled=true
# - cluster.routing.allocation.enable=all
- 'DISABLE_INSTALL_DEMO_CONFIG=true'
- 'DISABLE_SECURITY_PLUGIN=true'
- 'OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m'
- OPENSEARCH_INITIAL_ADMIN_PASSWORD=teSt!1
volumes:
- opensearch-data1:/usr/share/opensearch/data
ports:
- 9200:9200 # REST API
- 9600:9600 # Performance Analyzer
networks:
- opensearch-net
- otel-net
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
opensearch-node2:
image: opensearchproject/opensearch:latest
environment:
- cluster.name=opensearch-cluster
- node.name=opensearch-node2
- discovery.seed_hosts=opensearch-node1,opensearch-node2
- cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2
- bootstrap.memory_lock=true
# - plugins.security.disabled=true
# - cluster.routing.allocation.enable=all
- 'DISABLE_INSTALL_DEMO_CONFIG=true'
- 'DISABLE_SECURITY_PLUGIN=true'
- 'OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m'
- OPENSEARCH_INITIAL_ADMIN_PASSWORD=teSt!1
volumes:
- opensearch-data2:/usr/share/opensearch/data
networks:
- opensearch-net
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
# opensearch-dashboard:
# image: opensearchproject/opensearch-dashboards:latest
# ports:
# - 5601:5601
# expose:
# - '5601'
# environment:
# DISABLE_SECURITY_DASHBOARDS_PLUGIN: 'true'
# OPENSEARCH_HOSTS: '["http://opensearch-node1:9200","http://opensearch-node2:9200"]'
# networks:
# - opensearch-net
volumes:
opensearch-data1:
opensearch-data2:
networks:
opensearch-net:
Seeing errors such as:
opensearch-node2-1 | [2024-06-19T08:28:03,609][INFO ][o.o.c.c.JoinHelper ] [opensearch-node2] failed to join {opensearch-node1}{hNhhBK9MR4q5jugObtxpRw}{edTUBzEaR1qsvz0GM4gDBg}{172.18.0.2}{172.18.0.2:9300}{dimr}{shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={opensearch-node2}{p80y6oPlSuG2MKrQltpAzA}{gtBnsp2ASKS8LPyVmM_xFA}{172.19.0.3}{172.19.0.3:9300}{dimr}{shard_indexing_pressure_enabled=true}, minimumTerm=2, optionalJoin=Optional[Join{term=3, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={opensearch-node2}{p80y6oPlSuG2MKrQltpAzA}{gtBnsp2ASKS8LPyVmM_xFA}{172.19.0.3}{172.19.0.3:9300}{dimr}{shard_indexing_pressure_enabled=true}, targetNode={opensearch-node1}{hNhhBK9MR4q5jugObtxpRw}{edTUBzEaR1qsvz0GM4gDBg}{172.18.0.2}{172.18.0.2:9300}{dimr}{shard_indexing_pressure_enabled=true}}]}
Describe the bug
I have proven locally that I can get the example docker compose file to work locally, yet when I try the exact same file using docker swarm mode, it will not bring the cluster up.
The first node always tries to connect to itself and fails
To Reproduce Steps to reproduce the behavior:
docker stack deploy --prune --with-registry-auth --compose-file docker-compose.yml
This continues forever.
Expected behavior I would expect the node can join the cluster and elect a manager node exactly as it is capable of doing in docker compose.
Plugins Nothing but default.
Host/Environment (please complete the following information):
Additional context
As far as I can tell, there is nothing wrong with the docker networking.