processone / docker-ejabberd

Set of ejabberd Docker images
94 stars 77 forks source link

Clustering nodes that are on different servers #116

Open etranger7 opened 2 months ago

etranger7 commented 2 months ago

I'm using this docker image and trying to cluster 2 nodes that are on different servers, therefore 2 different public IPs. Just for testing, I successfully clustered 2 docker containers that are on the same machine.

However, when I try to define a FQDN in ERLANG_NODE_ARG, I get an error that I don't know how to overcome.

This container starts without errors (I'm skipping unrelated lines):

services:
  ej1container:
    hostname: ej1container          # containername works here too
    environment:
      - ERLANG_NODE_ARG=ej1@ej1container

This setup gives me an error

services:
  ej1container:
    hostname: ej1container          # containername works here too
    environment:
      - ERLANG_NODE_ARG=ej1@subdomain.domain.com

It looks like the container starts normally but when I do

docker exec ej1container ejabberdctl status

I get

Failed RPC connection to the node 'ej1@subdomain.domain.com': nodedown

I already pointed the A record of subdomain.domain.com to the public IP of the VPS where this is running.

There was a similar issue https://github.com/processone/docker-ejabberd/issues/106 but I don't see how the FQDN was integrated and what the solution was.

Any help would be much appreciated.

etranger7 commented 2 months ago

Update: While the main node is running on Server A as ej1@ej1container, I tried to add Server B to it to form a cluster and ran into these issues:

badlop commented 2 months ago
- ERLANG_NODE_ARG=ej1@subdomain.domain.com

That environment variable is read by the ejabberdctl script, and it is passed to the erl virtual machine as the argument -sname (or -name when the value has subdomains with a dot .). As a result, the erlang virtual machine names itself as ej1@subdomain.domain.com.


docker exec ej1container ejabberdctl status Failed RPC connection to the node 'ej1@subdomain.domain.com': nodedown

I get that same problem with a similar compose file:

```yaml version: '3.7' services: main: image: ghcr.io/badlop/ejabberd:dependabot container_name: ejabberd hostname: ej1container environment: - ERLANG_NODE_ARG=ejabberd@subdomain.domain.com - ERLANG_COOKIE=dummycookie123 ```

The solution in my case is to add subdomain.domain.com to /etc/hosts inside the container. That way ejabberdctl is able to connect correctly to the running node and get the status.


ERLANG_NODE_ARG=ej1@ej1container ejabberdctl join_cluster ej1@subdomain.domain.com System NOT running to use fully qualified hostnames

Right, you used the erlang short node name ej1container, so you cannot later use a long node name like sub.domains

Either use:

ERLANG_NODE_ARG=ej1@ej1container ejabberdctl join_cluster ej1@ej1container

If you use this in different machines, make sure the second one knows where to find ej1container (by adding it to /etc/hosts for example)

Or use:

ERLANG_NODE_ARG=ej1@ej1container.domain.com ejabberdctl join_cluster ej1@ej1container.domain.com

In that case, make sure erlang can know what does ej1container.domain.com point to.

etranger7 commented 2 months ago

Thank you for your reply @badlop . Here is what worked for me to move past the "Failed RPC connection to the node 'ej1@subdomain.domain.com': nodedown" message and get a positive STATUS message. In the docker compose file, I used

services:
  ejabberd:
    image: ejabberd/ecs:24.07
    container_name: ejabberd
    hostname: subdomain.domain.com
    environment:
      - CTL_ON_START=status
      - ERLANG_COOKIE=[removed]
      - ERLANG_NODE_ARG=ejabberd@subdomain.domain.com

However, when I try to connect to ejabberd@subdomain.domain.com that's on Server A, from Server B, I get

Error: error
Error: "This node cannot reach that node."

When I

docker exec ejabberd bin/ejabberdctl ping ejabberd@subdomain.domain.com

from Server B, I get pang.

When I ping Server A from Server B, I can reach it with no issues.

When I

docker exec -u root ejabberd ping subdomain.domain.com

from server B to Server A, again Server A is reachable.

I feel like I'm missing something here. Again, your help is much appreciated.

etranger7 commented 1 month ago

Hi @badlop , should I re-submit this issue under the issues of https://github.com/processone/ejabberd/ ? I'm wondering whether that's being more closely monitored and whether the issues with the containers should also be submitted there. Thanks.

badlop commented 1 month ago

This is a problem with that container image, so here seems a good place for the issue.

On the other hand, it may be a problem related to docker and erlang clustering, not only ejabberd, and you may search for related questions outside of ejabberd places.