rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
12.3k stars 3.91k forks source link

Feature flags detection sometimes triggers `erpc,noconnection` #8346

Closed lukebakken closed 2 months ago

lukebakken commented 1 year ago

Describe the bug

Logs ``` 2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: on node `rabbit@rabbit2`: 2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: exception error: {erpc,noconnection} 2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in function erpc:call/5 (erpc.erl, line 710) 2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from rabbit_ff_controller:rpc_call/5 (rabbit_ff_controller.erl, line 1123) 2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from lists:foreach_1/2 (lists.erl, line 1442) 2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from rabbit_feature_flags:check_node_compatibility_v1/2 (rabbit_feature_flags.erl, line 1599) 2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from rabbit_mnesia:check_rabbit_consistency/2 (rabbit_mnesia.erl, line 1017) 2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from rabbit_mnesia:check_consistency/5 (rabbit_mnesia.erl, line 948) 2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from rabbit_mnesia:check_cluster_consistency/2 (rabbit_mnesia.erl, line 746) 2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> Feature flags: in call from lists:foldl/3 (lists.erl, line 1350) 2023-05-24 01:39:55.227067-07:00 [error] <0.231.0> 2023-05-24 01:39:55.243345-07:00 [error] <0.277.0> Mnesia(rabbit@rabbit3): ** ERROR ** Mnesia on rabbit@rabbit3 could not connect to node(s) [rabbit@rabbit2] ```

Reproduction steps

See above.

Expected behavior

No erpc error - either it is re-tried, or it is not tried until disterl is definitely up and running.

Additional context

Observed in the following situations:

michaelklishin commented 1 year ago

I think the expected behavior should be "the operation is retried N times" :)

kepakiano commented 11 months ago

We stumbled over this by user error in #10100 and as requested, here is the step by step to get the same error message. Although, bear in mind that this happened to me only because I forgot the "rabbit@" when trying to call join_cluster:

$ docker network create test_network
1947438e01b9cced503ba3044be1afb1f5a6225fb64d265257b3547b947cad64
$ docker run -d --network test_network --name rabbit1 --privileged -v $(pwd)/cookie:/var/lib/rabbitmq/.erlang.cookie pivotalrabbitmq/rabbitmq:main-otp-max-bazel
b29a66ec3350cb7ee60975d3a1b8c0bd7918313f30833be76a113d0ea0c78590
$ docker container ls
CONTAINER ID        IMAGE                                         COMMAND                  CREATED             STATUS              PORTS                                                                                                                      NAMES
b29a66ec3350        pivotalrabbitmq/rabbitmq:main-otp-max-bazel   "docker-entrypoint.s…"   38 seconds ago      Up 36 seconds       1883/tcp, 4369/tcp, 5551-5552/tcp, 5671-5672/tcp, 8883/tcp, 15670-15676/tcp, 15691-15692/tcp, 25672/tcp, 61613-61614/tcp   rabbit1
$ docker exec -it b2 /bin/bash
root@b29a66ec3350:/# rabbitmqctl join_cluster this_node_does_not_exist
Clustering node rabbit@b29a66ec3350 with this_node_does_not_exist

13:03:53.487 [error] Feature flags: error while running:
Feature flags:   rabbit_ff_controller:running_nodes[]
Feature flags: on node `this_node_does_not_exist@b29a66ec3350`:
Feature flags:   exception error: {erpc,noconnection}
Feature flags:     in function  erpc:call/5 (erpc.erl, line 710)
Feature flags:     in call from rabbit_ff_controller:rpc_call/5 (rabbit_ff_controller.erl, line 1377)
Feature flags:     in call from rabbit_ff_controller:list_nodes_clustered_with/1 (rabbit_ff_controller.erl, line 477)
Feature flags:     in call from rabbit_ff_controller:check_node_compatibility_task/2 (rabbit_ff_controller.erl, line 389)
Feature flags:     in call from rabbit_db_cluster:can_join/1 (rabbit_db_cluster.erl, line 65)
Feature flags:     in call from rabbit_db_cluster:join/2 (rabbit_db_cluster.erl, line 97)
Feature flags:     in call from erpc:execute_call/4 (erpc.erl, line 589)

Error:
{:aborted_feature_flags_compat_check, {:error, {:erpc, :noconnection}}}
root@b29a66ec3350:/# 
michaelklishin commented 11 months ago

It's not clear to me from this log what exactly logs this message: the node or the shell where rabbitmqctl join_cluster this_node_does_not_exist is executed?

In any case, join_cluster should bail early if it cannot contact its not-to-be-joint.

CarvalhoRod commented 2 months ago

I don't know if you checked the log on the node that is running when you try to connect, but it's worth checking.

What may be wrong is your /var/lib/rabbitmq/.erlang.cookie, it has to be the same (with the same value) on all nodes in the cluster.

michaelklishin commented 2 months ago

@CarvalhoRod thank you for chiming in but this is RabbitMQ 101 and @lukebakken is a core team engineer. You can be sure such basics were accounted for.

That said, with https://github.com/rabbitmq/rabbitmq-server/pull/8411 this can probably be closed. If we get more details/observe more specific failure scenarios that are specific to the code and not the setup, we can always file a new issue.

michaelklishin commented 2 months ago

Setting the milestone to 3.13.7 because that's the most recent 3.13.x release at the time of writing.

michaelklishin commented 2 months ago

Note that the relevant PR was reverted in https://github.com/rabbitmq/rabbitmq-server/pull/11507, I will unset the milestone to reduce confusion.