rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
12.09k stars 3.9k forks source link

`khepri_db`: `function_clause` in `rabbit_federation_exchange_link_sup_sup` on network disconnect #12274

Open lukebakken opened 1 week ago

lukebakken commented 1 week ago

Describe the bug

Disconnecting the network to one node of a 3-node khepri-enabled cluster eventually results in a strange function_clause error:

rmq0-function_clause-stack.txt

The error also originates from the rabbit_federation_queue_link_sup_sup process as well. My test project enables the rabbitmq_federation plugin, but does not create any federation links.

Reproduction steps

Expected behavior

No error.

Additional context

This does not appear to affect the normal operation of PerfTest.

In addition, the following log lines appear:

rmq2-1       | 2024-09-11 00:29:10.084227+00:00 [error] <0.181.0>
rmq2-1       | 2024-09-11 00:29:10.084227+00:00 [error] <0.181.0> ** Cannot get connection id for node 'rabbit@rmq2.local'
rmq2-1       | 2024-09-11 00:29:10.084227+00:00 [error] <0.181.0>
rmq1-1       | 2024-09-11 00:29:10.096091+00:00 [error] <0.181.0>
rmq1-1       | 2024-09-11 00:29:10.096091+00:00 [error] <0.181.0> ** Cannot get connection id for node 'rabbit@rmq1.local'
rmq1-1       | 2024-09-11 00:29:10.096091+00:00 [error] <0.181.0>

These log lines originate in OTP itself:

lbakken@shostakovich ~/development/erlang/otp (master =)
$ git grep -i 'cannot get connection'
lib/kernel/src/net_kernel.erl:1051:            error_logger:error_msg("~n** Cannot get connection id for node ~w~n",
lib/kernel/src/net_kernel.erl:1156:                error_logger:error_msg("~n** Cannot get connection id for node ~w~n",
lib/kernel/src/net_kernel.erl:1545:                    error_logger:error_msg("~n** Cannot get connection id for node ~w~n",

What's odd is that the error messages originate from the node to which the error message refers 🤔

the-mikedavis commented 1 week ago

The rabbit_db_msup module and its callers will need some updates to handle potential timeouts when interacting with Khepri like in https://github.com/rabbitmq/rabbitmq-server/pull/11785

The changes will probably be trickier for this module since the commands don't come from a user so it's not a simple matter of bubbling up and returning an error.

mkuratczyk commented 1 week ago

I just hit that with rabbit_shovel_dyn_worker_sup_sup, which makes sense, since it's also a mirrored supervisor.