processone / ejabberd

Robust, Ubiquitous and Massively Scalable Messaging Platform (XMPP, MQTT, SIP Server)
https://www.process-one.net/ejabberd/
Other
6.12k stars 1.51k forks source link

ERLANG_NODE set to just a username does not work in recent OTP versions #4288

Closed mzealey closed 1 month ago

mzealey commented 1 month ago

Commit fa12301e085562962fc865e72ad9361ba41fcb7d added the new -sname undefined way of starting up the ejabberdctl module in OTP23+, however it appears to be causing us issues.

Running latest master on OTP26, ejabberdctl script cannot connect to the node if ERLANG_NODE=ejabberd, however it works if ERLANG_NODE=ejabberd@ejabberd. This is being run in a local container where the hostname is set to ejabberd.

It appears that in the failing case, -eval "net_kernel:connect_node('ejabberd')" is being run and failing silently.

A minimal reproducible test case has the following working:

/usr/local/bin/erl -sname undefined -setcookie test-shared-cookie-for-clustering -eval "net_kernel:connect_node('ejabberd@ejabberd')" -s ejabberd_ctl -extra ejabberd
Erlang/OTP 26 [erts-14.2.5.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit:ns]

Usage: ejabberdctl [--no-timeout] [--node nodename] [--version api_version] command [arguments]

Available commands in this ejabberd node:
  abort_delete_old_messages host                                      Abort currently running delete old offline messages operation
...

But the following (which is roughly what happens when ERLANG_NODE=ejabberd) fails:

/usr/local/bin/erl -sname undefined -setcookie test-shared-cookie-for-clustering -eval "net_kernel:connect_node('ejabberd')" -s ejabberd_ctl -extra ejabberd
Erlang/OTP 26 [erts-14.2.5.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit:ns]

=ERROR REPORT==== 2-Oct-2024::06:25:08.571910 ===

** Cannot get connection id for node ejabberd

Failed RPC connection to the node ejabberd@ejabberd: nodedown

The docs for the file clearly say that this no hostname possibility is allowed, and actually it's very useful to us:

# The next variable allows to explicitly specify erlang node for ejabberd
# It can be given in different formats:
# ERLANG_NODE=ejabberd
#   Lets erlang add hostname to the node (ejabberd uses short name in this case)
badlop commented 1 month ago

As described in https://www.erlang.org/doc/system/distributed.html#nodes

A node is an executing Erlang runtime system that has been given a name, using the command-line flag -name (long names) or -sname (short names). The format of the node name is an atom name@host. name is the name given by the user. host is the full host name if long names are used, or the first part of the host name if short names are used.

A smaller reproduction example:

``` $ erl -sname e3@localhost (e3@localhost)1> $ erl -sname undefined -eval "net_kernel:connect_node('e3@localhost')" (nonode@nohost)1> q(). $ erl -sname undefined -eval "net_kernel:connect_node('e3@localhost')" (39JC1CT72Q3PG@atenea)1> User switch command (type h for help) --> r e3@localhost --> j 1 {shell,start,[init]} 2* {e3@localhost,shell,start,[]} --> q ``` ``` $ erl -sname e3 (e3@atenea)1> node(). e3@atenea $ erl -sname undefined -eval "net_kernel:connect_node('e3@atenea')" (nonode@nohost)1> User switch command (type h for help) --> r e3@atenea --> j 1 {shell,start,[init]} 2* {e3@atenea,shell,start,[]} --> q $ erl -sname undefined -eval "net_kernel:connect_node('e3')" =ERROR REPORT==== 2-Oct-2024::17:16:20.482158 === ** Cannot get connection id for node e3 ```

In summary, some OTP functions allow to provide only the user part of the nodename (for example the -name and -sname command-line arguments), but other functions require the full nodename (for example the shell r command, and the net_kernel:connect_node function

no hostname possibility is allowed, and actually it's very useful to us:

As explained by the ejabberdctl.cfg documentation, and erlang documentation, and this experiment: the erlang node name always contains the host part, and some tools allow to provide only the user part, then those tools add the host part.

In other words, even if you configure only ERLANG_NODE=ejabberd, the actual node name is ejabberd@machinename. And the actual node name must be provided when calling net_adm:connect_node.

The obvious solution would be to check in ejabberdctl if ERLANG_NODE has just user part, in that case add the host part, to ensure all the user cases will work correctly

no hostname possibility is allowed, and actually it's very useful to us

Is it useful because that allows you to use the same configuration file in several machines which have different machine names? In that case, the obvious solution should work for you too, right?

Example patch:

diff --git a/ejabberdctl.template b/ejabberdctl.template
index 83ec7e1bd..21be6430f 100755
--- a/ejabberdctl.template
+++ b/ejabberdctl.template
@@ -66,6 +66,7 @@ done
 # shellcheck source=ejabberdctl.cfg.example
 [ -f "$EJABBERDCTL_CONFIG_PATH" ] && . "$EJABBERDCTL_CONFIG_PATH"
 [ -n "$ERLANG_NODE_ARG" ] && ERLANG_NODE="$ERLANG_NODE_ARG"
+[ "$ERLANG_NODE" = "${ERLANG_NODE%@*}" ] && ERLANG_NODE="$ERLANG_NODE@$(hostname -s)"
 [ "$ERLANG_NODE" = "${ERLANG_NODE%.*}" ] && S="-s"
 : "${SPOOL_DIR:="{{spool_dir}}"}"
 : "${EJABBERD_LOG_PATH:="$LOGS_DIR/ejabberd.log"}"
mzealey commented 1 month ago

Yes exactly, this sounds like it would work, however because I don't know ejabberdctl script in much depth I'm not sure if modifying $ERLANG_NODE in this way in the script would produce other issues later on (ie should be scoped to a new variable specifically for net_adm:connect_node) or if it is ok to do globally