Closed lukebakken closed 5 days ago
This is strange, because I do remember testing this with 3.13 when the peer discovery changes were being made.
Next step will be to reproduce this using the generic-unix package so @dumbbell can look into it.
This could just be my test environment 🤷♂️
Do you get additional log messages in the startup_*
files that capture stdout & stderr? The temporary hidden node won't forward its logs messages to the parent node's logger. I'm wondering if they end up somewhere.
I will check that out in my docker environment.
I did post a follow-up comment the other day that I must not have saved. I created the following repo which will bring up a 3-node cluster using inter-node-TLS and rabbitmq/rabbitmq-server@main
:
https://github.com/lukebakken/rabbitmq-server-11534
It worked fine the other day, but I will return to it today to make sure, as well as test with the 3.13.x and 3.12.x branches.
This issue may be just my docker project, or something else in my environment. Stay tuned.
Running everything on localhost
using rabbitmq/rabbitmq-server
main
and v3.13.x
branches work fine, but using docker and 3.13.3
images just doesn't work. If I use 3.12.14
images my cluster starts fine.
@dumbbell here are logs from starting up via docker compose: make-up.txt rmq0.log rmq1.log rmq2.log
AHA! @dumbbell I think these are the log files you're looking for: rabbit@rmq0.local.log rabbit@rmq1.local.log rabbit@rmq2.local.log
Looks like this is the situation:
2024-06-25 22:15:41.286649+00:00 [info] <0.276.0> Peer discovery: node 'rabbit@rmq1.local' selected for auto-clustering but its DB layer is not ready; waiting before retrying...
2024-06-25 22:15:42.622963+00:00 [info] <0.276.0> Peer discovery: node 'rabbit@rmq1.local' selected for auto-clustering but its DB layer is not ready; waiting before retrying...
...investigating...
If I use the "regular", non TLS-encrypted distributed Erlang version of my project, it works fine for clustering with 3.13.3 (https://github.com/lukebakken/docker-rabbitmq-cluster/tree/main). So, this issue must have something to do with the hidden peer node and proto_dist inet_tls
🤔
Logs from starting lukebakken/docker-rabbitmq-cluster
using this docker image:
rmq0.log
rmq1.log
rmq2.log
One thing to note is what the classic config peer discovery plugin returns when asked for the discovered nodes: it returns rmq2
, rmq1
and the local node (rmq0
, rmq1
or rmq2
). This means rmq1
and rmq2
know about themselves twice, but never hear of rmq0
.
I'm not sure this can explain the problem, but I'm curious to know why that is.
Aha, I don't think I would have noticed that!
> rg -A 1 'backend returned the following config'
rmq2.log
2732:rmq2-1 | 2024-06-26 18:22:45.313021+00:00 [debug] <0.276.0> Peer discovery: backend returned the following configuration:
2733-rmq2-1 | 2024-06-26 18:22:45.313021+00:00 [debug] <0.276.0> {ok,{['rabbit@rmq2.local','rabbit@rmq1.local','rabbit@rmq2.local'],disc}}
rmq1.log
2390:rmq1-1 | 2024-06-26 18:20:43.817877+00:00 [debug] <0.276.0> Peer discovery: backend returned the following configuration:
2391-rmq1-1 | 2024-06-26 18:20:43.817877+00:00 [debug] <0.276.0> {ok,{['rabbit@rmq2.local','rabbit@rmq1.local','rabbit@rmq1.local'],disc}}
rmq0.log
2390:rmq0-1 | 2024-06-26 18:20:43.746033+00:00 [debug] <0.276.0> Peer discovery: backend returned the following configuration:
2391-rmq0-1 | 2024-06-26 18:20:43.746033+00:00 [debug] <0.276.0> {ok,{['rabbit@rmq2.local','rabbit@rmq1.local','rabbit@rmq0.local'],disc}}
...that's definitely an issue.
@dumbbell I owe you 🍻 the next time we meet up:
I don't think there are enough facepalms for this.
Event if there is a configuration error, RabbitMQ shouldn't leave you alone. You had no easy way to spot the problem. RabbitMQ should give some warnings and not block like this.
At least now I know what to look for in the debug output. Should we log a warning if a peer discovery backend returns duplicate identical node names?
Difficult to say. I don't know if a peer discovery backend could return duplicate without it to be a wrong value.
About the classic config one, we could log several a warning:
We could also return a fatal error if the current node is not in the list.
I don't know enough how they are used in production unfortunately.
Describe the bug
Starting a 3-node cluster using classic peer discovery and inter-node TLS does not work with 3.13.x (tested with 3.13.0 and 3.13.3)
Reproduction steps
The following will attempt to create a 3.13 cluster with inter-node TLS:
However, it will get stuck on peer discovery. Logs from each node are attached here: rmq0.log.txt rmq1.log.txt rmq2.log.txt
Also note that the logs show that multiple hidden nodes are started which doesn't seem right.
Hidden nodes appear to be started with the correct arguments:
Expected behavior
A 3-node cluster is formed.
Additional context
Using RabbitMQ 3.12 does work:
Notes:
make clean
, as it will set the necessary directory permissions first.