Peer discovery using inter-node TLS not working with 3.13.x

lukebakken commented 1 week ago

Describe the bug

Starting a 3-node cluster using classic peer discovery and inter-node TLS does not work with 3.13.x (tested with 3.13.0 and 3.13.3)

Reproduction steps

The following will attempt to create a 3.13 cluster with inter-node TLS:

git clone https://github.com/lukebakken/docker-rabbitmq-cluster.git
cd rabbitmq-docker-cluster
git checkout tls
make DOCKER_FRESH=true up

However, it will get stuck on peer discovery. Logs from each node are attached here: rmq0.log.txt rmq1.log.txt rmq2.log.txt

Also note that the logs show that multiple hidden nodes are started which doesn't seem right.

Hidden nodes appear to be started with the correct arguments:

rmq1-1  | 2024-06-22 16:53:21.794523+00:00 [info] <0.276.0> DB: virgin node -> run peer discovery
rmq1-1  | 2024-06-22 16:53:21.794597+00:00 [debug] <0.276.0> Peer discovery: backend returned the following configuration:
rmq1-1  | 2024-06-22 16:53:21.794597+00:00 [debug] <0.276.0>   {ok,{['rabbit@rmq2.local','rabbit@rmq1.local','rabbit@rmq1.local'],disc}}
rmq1-1  | 2024-06-22 16:53:21.794763+00:00 [debug] <0.276.0> Peer discovery: peer node arguments: #{args =>
rmq1-1  | 2024-06-22 16:53:21.794763+00:00 [debug] <0.276.0>                                            ["-ssl_dist_optfile",
rmq1-1  | 2024-06-22 16:53:21.794763+00:00 [debug] <0.276.0>                                             "/etc/rabbitmq/inter_node_tls.config",
rmq1-1  | 2024-06-22 16:53:21.794763+00:00 [debug] <0.276.0>                                             "-proto_dist","inet_tls","-pa",
rmq1-1  | 2024-06-22 16:53:21.794763+00:00 [debug] <0.276.0>                                             "/opt/erlang/lib/erlang/lib/ssl-11.1.4/ebin",
rmq1-1  | 2024-06-22 16:53:21.794763+00:00 [debug] <0.276.0>                                             "-boot","start_sasl","-hidden"],
rmq1-1  | 2024-06-22 16:53:21.794763+00:00 [debug] <0.276.0>                                        name => "rabbit-130-58",
rmq1-1  | 2024-06-22 16:53:21.794763+00:00 [debug] <0.276.0>                                        host => "rmq1.local",
rmq1-1  | 2024-06-22 16:53:21.794763+00:00 [debug] <0.276.0>                                        connection => standard_io,
rmq1-1  | 2024-06-22 16:53:21.794763+00:00 [debug] <0.276.0>                                        wait_boot => infinity,
rmq1-1  | 2024-06-22 16:53:21.794763+00:00 [debug] <0.276.0>                                        longnames => true}
rmq1-1  | 2024-06-22 16:53:22.047061+00:00 [debug] <0.276.0> Peer discovery: using temporary hidden node 'rabbit-130-58@rmq1.local' to query discovered peers properties

Expected behavior

A 3-node cluster is formed.

Additional context

Using RabbitMQ 3.12 does work:

make DOCKER_FRESH=true RABBITMQ_DOCKER_TAG=rabbitmq:3.12-management up

Notes:

To clean out RabbitMQ data, use make clean, as it will set the necessary directory permissions first.

lukebakken commented 1 week ago

This is strange, because I do remember testing this with 3.13 when the peer discovery changes were being made.

Next step will be to reproduce this using the generic-unix package so @dumbbell can look into it.

This could just be my test environment 🤷‍♂️

dumbbell commented 1 week ago

Do you get additional log messages in the startup_* files that capture stdout & stderr? The temporary hidden node won't forward its logs messages to the parent node's logger. I'm wondering if they end up somewhere.

lukebakken commented 1 week ago

I will check that out in my docker environment.

I did post a follow-up comment the other day that I must not have saved. I created the following repo which will bring up a 3-node cluster using inter-node-TLS and rabbitmq/rabbitmq-server@main:

https://github.com/lukebakken/rabbitmq-server-11534

It worked fine the other day, but I will return to it today to make sure, as well as test with the 3.13.x and 3.12.x branches.

This issue may be just my docker project, or something else in my environment. Stay tuned.

lukebakken commented 1 week ago

Running everything on localhost using rabbitmq/rabbitmq-server main and v3.13.x branches work fine, but using docker and 3.13.3 images just doesn't work. If I use 3.12.14 images my cluster starts fine.

@dumbbell here are logs from starting up via docker compose: make-up.txt rmq0.log rmq1.log rmq2.log

lukebakken commented 1 week ago

AHA! @dumbbell I think these are the log files you're looking for: rabbit@rmq0.local.log rabbit@rmq1.local.log rabbit@rmq2.local.log

Looks like this is the situation:

2024-06-25 22:15:41.286649+00:00 [info] <0.276.0> Peer discovery: node 'rabbit@rmq1.local' selected for auto-clustering but its DB layer is not ready; waiting before retrying...
2024-06-25 22:15:42.622963+00:00 [info] <0.276.0> Peer discovery: node 'rabbit@rmq1.local' selected for auto-clustering but its DB layer is not ready; waiting before retrying...

...investigating...

lukebakken commented 1 week ago

If I use the "regular", non TLS-encrypted distributed Erlang version of my project, it works fine for clustering with 3.13.3 (https://github.com/lukebakken/docker-rabbitmq-cluster/tree/main). So, this issue must have something to do with the hidden peer node and proto_dist inet_tls 🤔

lukebakken commented 6 days ago

Logs from starting lukebakken/docker-rabbitmq-cluster using this docker image: rmq0.log rmq1.log rmq2.log

dumbbell commented 5 days ago

One thing to note is what the classic config peer discovery plugin returns when asked for the discovered nodes: it returns rmq2, rmq1 and the local node (rmq0, rmq1 or rmq2). This means rmq1 and rmq2 know about themselves twice, but never hear of rmq0.

I'm not sure this can explain the problem, but I'm curious to know why that is.

lukebakken commented 5 days ago

Aha, I don't think I would have noticed that!

> rg -A 1 'backend returned the following config'
rmq2.log
2732:rmq2-1  | 2024-06-26 18:22:45.313021+00:00 [debug] <0.276.0> Peer discovery: backend returned the following configuration:
2733-rmq2-1  | 2024-06-26 18:22:45.313021+00:00 [debug] <0.276.0>   {ok,{['rabbit@rmq2.local','rabbit@rmq1.local','rabbit@rmq2.local'],disc}}

rmq1.log
2390:rmq1-1  | 2024-06-26 18:20:43.817877+00:00 [debug] <0.276.0> Peer discovery: backend returned the following configuration:
2391-rmq1-1  | 2024-06-26 18:20:43.817877+00:00 [debug] <0.276.0>   {ok,{['rabbit@rmq2.local','rabbit@rmq1.local','rabbit@rmq1.local'],disc}}

rmq0.log
2390:rmq0-1  | 2024-06-26 18:20:43.746033+00:00 [debug] <0.276.0> Peer discovery: backend returned the following configuration:
2391-rmq0-1  | 2024-06-26 18:20:43.746033+00:00 [debug] <0.276.0>   {ok,{['rabbit@rmq2.local','rabbit@rmq1.local','rabbit@rmq0.local'],disc}}

...that's definitely an issue.

lukebakken commented 5 days ago

@dumbbell I owe you 🍻 the next time we meet up:

https://github.com/lukebakken/docker-rabbitmq-cluster/blob/69791a36b664ed448b970ca0f2ead1742e4822b4/rmq1/rabbitmq.conf#L12-L14

I don't think there are enough facepalms for this.

dumbbell commented 5 days ago

Event if there is a configuration error, RabbitMQ shouldn't leave you alone. You had no easy way to spot the problem. RabbitMQ should give some warnings and not block like this.

lukebakken commented 4 days ago

At least now I know what to look for in the debug output. Should we log a warning if a peer discovery backend returns duplicate identical node names?

dumbbell commented 4 days ago

Difficult to say. I don't know if a peer discovery backend could return duplicate without it to be a wrong value.

About the classic config one, we could log several a warning:

if there is a duplicate
if the current node is missing from the list

We could also return a fatal error if the current node is not in the list.

I don't know enough how they are used in production unfortunately.

rabbitmq / rabbitmq-server