rabbitmq / rabbitmq-federation

RabbitMQ Federation plugin
https://www.rabbitmq.com/
Other
40 stars 21 forks source link

Cluster node won't start after reboot when setup federated exchange queue #111

Closed rerime closed 4 years ago

rerime commented 4 years ago

Setup federation between two clusters

# On cluster A
# federation user
  rabbitmqctl add_vhost vhost
  rabbitmqctl add_user fed_user fed_password
  rabbitmqctl set_permissions --vhost vhost fed_user '^fed' '^fed' '^fed'    
# upstream fed_a_b
  rabbitmqctl set_parameter --vhost=vhost federation-upstream fed_a_b "{\"uri\":\"amqp://fed_user:fed_password@host_a/vhost\"}"
# a -> b policies
  rabbitmqctl set_policy --vhost=vhost fed_a_b_exchange_policy 'fed\.exchange\.a_b' '{"federation-upstream":"fed_a_b"}'
  rabbitmqctl set_policy --vhost=vhost fed_b_a_exchange_policy 'fed\.exchange\.b_a' '{"federation-upstream":"fed_a_b"}'
  rabbitmqctl set_policy --vhost=vhost fed_a_b_events_policy   '^fed\.queue\.events\.a_b' '{"federation-upstream":"fed_a_b"}'

# On cluster B
# federation user
  rabbitmqctl add_vhost vhost
  rabbitmqctl add_user fed_user fed_password
  rabbitmqctl set_permissions --vhost vhost fed_user '^fed' '^fed' '^fed'
# upstream fed_b_a
  rabbitmqctl set_parameter --vhost=vhost federation-upstream fed_b_a "{\"uri\":\"amqp://fed_user:fed_password@host_b/vhost\"}"
# b -> a policies
  rabbitmqctl set_policy --vhost=vhost fed_b_a_exchange_policy 'fed\.exchange\.b_a' '{"federation-upstream":"fed_b_a"}'
  rabbitmqctl set_policy --vhost=vhost fed_a_b_exchange_policy 'fed\.exchange\.a_b' '{"federation-upstream":"fed_b_a"}'
  rabbitmqctl set_policy --vhost=vhost fed_b_a_events_policy   '^fed\.queue\.events\.b_a' '{"federation-upstream":"fed_b_a"}'

Add queue on hq-0 cluster A

Screenshot 2020-08-04 at 18 34 25

Reboot node ha-0 on cluster A where federated queue is located, node won't start

Log

``` Starting broker...2020-08-04 14:56:21.327 [info] <0.269.0> node : rabbit@rabbitmq.local home dir : /var/lib/rabbitmq config file(s) : /etc/rabbitmq/advanced.config : /etc/rabbitmq/rabbitmq.conf cookie hash : NrIYU1tF9/JIWz5dzdEPqQ== log(s) : database dir : /var/lib/rabbitmq/mnesia/rabbit@rabbitmq.local 2020-08-04 14:56:21.339 [info] <0.269.0> Running boot step pre_boot defined by app rabbit 2020-08-04 14:56:21.339 [info] <0.269.0> Running boot step rabbit_core_metrics defined by app rabbit 2020-08-04 14:56:21.339 [info] <0.269.0> Running boot step rabbit_alarm defined by app rabbit 2020-08-04 14:56:21.348 [info] <0.405.0> Memory high watermark set to 244 MiB (256000000 bytes) of 5961 MiB (6250688512 bytes) total 2020-08-04 14:56:21.359 [info] <0.407.0> Enabling free disk space monitoring 2020-08-04 14:56:21.359 [info] <0.407.0> Disk free limit set to 50MB 2020-08-04 14:56:21.363 [info] <0.269.0> Running boot step code_server_cache defined by app rabbit 2020-08-04 14:56:21.364 [info] <0.269.0> Running boot step file_handle_cache defined by app rabbit 2020-08-04 14:56:21.365 [info] <0.414.0> Limiting to approx 1048479 file handles (943629 sockets) 2020-08-04 14:56:21.365 [info] <0.415.0> FHC read buffering: OFF 2020-08-04 14:56:21.365 [info] <0.415.0> FHC write buffering: ON 2020-08-04 14:56:21.366 [info] <0.269.0> Running boot step worker_pool defined by app rabbit 2020-08-04 14:56:21.366 [info] <0.400.0> Will use 2 processes for default worker pool 2020-08-04 14:56:21.366 [info] <0.400.0> Starting worker pool 'worker_pool' with 2 processes in it 2020-08-04 14:56:21.367 [info] <0.269.0> Running boot step database defined by app rabbit 2020-08-04 14:56:21.696 [info] <0.269.0> Waiting for Mnesia tables for 30000 ms, 9 retries left 2020-08-04 14:56:21.830 [info] <0.269.0> Successfully synced tables from a peer 2020-08-04 14:56:21.834 [info] <0.269.0> Waiting for Mnesia tables for 30000 ms, 9 retries left 2020-08-04 14:56:21.834 [info] <0.269.0> Successfully synced tables from a peer 2020-08-04 14:56:21.835 [info] <0.269.0> Waiting for Mnesia tables for 30000 ms, 9 retries left 2020-08-04 14:56:21.835 [info] <0.269.0> Successfully synced tables from a peer 2020-08-04 14:56:21.865 [info] <0.269.0> Waiting for Mnesia tables for 30000 ms, 9 retries left 2020-08-04 14:56:21.865 [info] <0.269.0> Successfully synced tables from a peer 2020-08-04 14:56:21.878 [info] <0.269.0> Will register with peer discovery backend rabbit_peer_discovery_k8s 2020-08-04 14:56:21.951 [info] <0.269.0> Running boot step database_sync defined by app rabbit 2020-08-04 14:56:21.951 [info] <0.269.0> Running boot step feature_flags defined by app rabbit 2020-08-04 14:56:21.951 [info] <0.269.0> Running boot step codec_correctness_check defined by app rabbit 2020-08-04 14:56:21.952 [info] <0.269.0> Running boot step external_infrastructure defined by app rabbit 2020-08-04 14:56:21.952 [info] <0.269.0> Running boot step rabbit_registry defined by app rabbit 2020-08-04 14:56:21.952 [info] <0.269.0> Running boot step rabbit_auth_mechanism_cr_demo defined by app rabbit 2020-08-04 14:56:21.952 [info] <0.269.0> Running boot step rabbit_queue_location_random defined by app rabbit 2020-08-04 14:56:21.952 [info] <0.269.0> Running boot step rabbit_event defined by app rabbit 2020-08-04 14:56:21.953 [info] <0.269.0> Running boot step rabbit_auth_mechanism_amqplain defined by app rabbit 2020-08-04 14:56:21.953 [info] <0.269.0> Running boot step rabbit_auth_mechanism_plain defined by app rabbit 2020-08-04 14:56:21.953 [info] <0.269.0> Running boot step rabbit_exchange_type_direct defined by app rabbit 2020-08-04 14:56:21.953 [info] <0.269.0> Running boot step rabbit_exchange_type_fanout defined by app rabbit 2020-08-04 14:56:21.954 [info] <0.269.0> Running boot step rabbit_exchange_type_headers defined by app rabbit 2020-08-04 14:56:21.954 [info] <0.269.0> Running boot step rabbit_exchange_type_topic defined by app rabbit 2020-08-04 14:56:21.954 [info] <0.269.0> Running boot step rabbit_mirror_queue_mode_all defined by app rabbit 2020-08-04 14:56:21.954 [info] <0.269.0> Running boot step rabbit_mirror_queue_mode_exactly defined by app rabbit 2020-08-04 14:56:21.955 [info] <0.269.0> Running boot step rabbit_mirror_queue_mode_nodes defined by app rabbit 2020-08-04 14:56:21.955 [info] <0.269.0> Running boot step rabbit_priority_queue defined by app rabbit 2020-08-04 14:56:21.955 [info] <0.269.0> Priority queues enabled, real BQ is rabbit_variable_queue 2020-08-04 14:56:21.955 [info] <0.269.0> Running boot step rabbit_queue_location_client_local defined by app rabbit 2020-08-04 14:56:21.955 [info] <0.269.0> Running boot step rabbit_queue_location_min_masters defined by app rabbit 2020-08-04 14:56:21.956 [info] <0.269.0> Running boot step kernel_ready defined by app rabbit 2020-08-04 14:56:21.956 [info] <0.269.0> Running boot step rabbit_sysmon_minder defined by app rabbit 2020-08-04 14:56:21.956 [info] <0.269.0> Running boot step rabbit_epmd_monitor defined by app rabbit 2020-08-04 14:56:21.957 [info] <0.517.0> epmd monitor knows us, inter-node communication (distribution) port: 25672 2020-08-04 14:56:21.958 [info] <0.269.0> Running boot step guid_generator defined by app rabbit 2020-08-04 14:56:21.959 [info] <0.269.0> Running boot step rabbit_node_monitor defined by app rabbit 2020-08-04 14:56:22.079 [info] <0.521.0> Starting rabbit_node_monitor 2020-08-04 14:56:22.079 [info] <0.269.0> Running boot step delegate_sup defined by app rabbit 2020-08-04 14:56:22.080 [info] <0.269.0> Running boot step rabbit_memory_monitor defined by app rabbit 2020-08-04 14:56:22.082 [info] <0.269.0> Running boot step core_initialized defined by app rabbit 2020-08-04 14:56:22.082 [info] <0.269.0> Running boot step upgrade_queues defined by app rabbit 2020-08-04 14:56:22.160 [info] <0.269.0> Running boot step rabbit_connection_tracking defined by app rabbit 2020-08-04 14:56:22.160 [info] <0.269.0> Running boot step rabbit_connection_tracking_handler defined by app rabbit 2020-08-04 14:56:22.160 [info] <0.269.0> Running boot step rabbit_exchange_parameters defined by app rabbit 2020-08-04 14:56:22.161 [info] <0.269.0> Running boot step rabbit_mirror_queue_misc defined by app rabbit 2020-08-04 14:56:22.162 [info] <0.269.0> Running boot step rabbit_policies defined by app rabbit 2020-08-04 14:56:22.163 [info] <0.269.0> Running boot step rabbit_policy defined by app rabbit 2020-08-04 14:56:22.164 [info] <0.269.0> Running boot step rabbit_queue_location_validator defined by app rabbit 2020-08-04 14:56:22.164 [info] <0.269.0> Running boot step rabbit_quorum_memory_manager defined by app rabbit 2020-08-04 14:56:22.164 [info] <0.269.0> Running boot step rabbit_vhost_limit defined by app rabbit 2020-08-04 14:56:22.164 [info] <0.269.0> Running boot step recovery defined by app rabbit 2020-08-04 14:56:22.299 [info] <0.555.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@rabbitmq.local/msg_stores/vhosts/5JIAVEDLEJHIGKZFVWKV87LER' for vhost 'vhost' exists 2020-08-04 14:56:22.303 [info] <0.555.0> Starting message stores for vhost 'vhost' 2020-08-04 14:56:22.303 [info] <0.559.0> Message store "5JIAVEDLEJHIGKZFVWKV87LER/msg_store_transient": using rabbit_msg_store_ets_index to provide index 2020-08-04 14:56:22.308 [info] <0.555.0> Started message store of type transient for vhost 'vhost' 2020-08-04 14:56:22.308 [info] <0.563.0> Message store "5JIAVEDLEJHIGKZFVWKV87LER/msg_store_persistent": using rabbit_msg_store_ets_index to provide index 2020-08-04 14:56:22.310 [warning] <0.563.0> Message store "5JIAVEDLEJHIGKZFVWKV87LER/msg_store_persistent": rebuilding indices from scratch 2020-08-04 14:56:22.312 [info] <0.555.0> Started message store of type persistent for vhost 'vhost' 2020-08-04 14:56:22.474 [error] <0.418.0> ** Generic server <0.418.0> terminating ** Last message in was {'$gen_cast',{submit_async,#Fun,<0.417.0>}} ** When Server state == undefined ** Reason for termination == ** {{case_clause,undefined},[{rabbit_federation_util,pgname,1,[{file,"src/rabbit_federation_util.erl"},{line,74}]},{rabbit_federation_exchange_link,x,1,[{file,"src/rabbit_federation_exchange_link.erl"},{line,265}]},{rabbit_federation_exchange_link,cast,2,[{file,"src/rabbit_federation_exchange_link.erl"},{line,254}]},{rabbit_federation_exchange,add_binding,3,[{file,"src/rabbit_federation_exchange.erl"},{line,62}]},{rabbit_exchange,'-callback/4-lc$^1/1-0-',5,[{file,"src/rabbit_exchange.erl"},{line,71}]},{rabbit_exchange,callback,4,[{file,"src/rabbit_exchange.erl"},{line,72}]},{rabbit_binding,x_callback,4,[{file,"src/rabbit_binding.erl"},{line,700}]},{rabbit_binding,'-recover_semi_durable_route/3-fun-0-',3,[{file,"src/rabbit_binding.erl"},{line,124}]}]} 2020-08-04 14:56:22.474 [error] <0.418.0> CRASH REPORT Process <0.418.0> with 0 neighbours exited with reason: no case clause matching undefined in rabbit_federation_util:pgname/1 line 74 in gen_server2:terminate/3 line 1183 2020-08-04 14:56:22.474 [error] <0.416.0> Supervisor worker_pool_sup had child 1 started with worker_pool_worker:start_link(worker_pool) at <0.418.0> exit with reason no case clause matching undefined in rabbit_federation_util:pgname/1 line 74 in context child_terminated ```

System

``` OS: Linux Uptime (seconds): 72270 RabbitMQ version: 3.8.5 Node name: rabbit@rmqrb-rabbitmq-ha-0 Erlang configuration: Erlang/OTP 23 [erts-11.0.3] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:64] Erlang processes: 981 used, 1048576 limit Scheduler run queue: 2 Cluster heartbeat timeout (net_ticktime): 60 Plugins Enabled plugin file: /etc/rabbitmq/enabled_plugins Enabled plugins: * rabbitmq_prometheus * rabbitmq_peer_discovery_k8s * rabbitmq_federation_management * rabbitmq_peer_discovery_common * rabbitmq_shovel_management * rabbitmq_shovel * amqp10_client * rabbitmq_federation * amqp10_common * rabbitmq_auth_backend_ldap * prometheus * rabbitmq_consistent_hash_exchange * rabbitmq_management * amqp_client * rabbitmq_web_dispatch * cowboy * cowlib * rabbitmq_management_agent ```

michaelklishin commented 4 years ago

Duplicate of #109.

michaelklishin commented 4 years ago

Something in your environment results in plugin's defaults not being loaded. You can explicitly set the undefined value of rabbitmq_federation.pgroup_name_cluster_id in advanced.config to false:

[
  {rabbitmq_federation, [
    {pgroup_name_cluster_id, false}
  ]}
].

or upgrade to 3.8.6 where all values other than true would be treated as false.

michaelklishin commented 4 years ago

Looking at the log I have a good guess as to what may be going on here. This seems to be a yet another "definition import before plugins are activated" problem that 3.8.6 addressed for different reasons. See https://github.com/rabbitmq/rabbitmq-server/issues/2384.

rerime commented 4 years ago

Same result with 3.8.6

michaelklishin commented 4 years ago

@rerime the exception in the provided log snippet states that rabbit_federation_util:pgname/1 runs into an exception where a case statement does not handle the effective configuration value, undefined. As of https://github.com/rabbitmq/rabbitmq-federation/pull/109/files and https://github.com/rabbitmq/rabbitmq-federation/commit/a6308bc7775985f7c899bd612825603ed5b7e305, the plugin will treat any value other than true loaded from its config file as a false.

So I find it hard to believe that this scenario is realistic with 3.8.6. I will try to reproduce using the provided steps above.

rerime commented 4 years ago

@michaelklishin Sorry for inconvenience. Check again, seems it was fixed.

michaelklishin commented 4 years ago

I cannot reproduce with 3.8.6 with the following setup:

rerime commented 4 years ago

@michaelklishin Thx for your support. We've fixed another bug with management.load_definitions = /path/to/definitions/file.json instead of load_definitions = /path/to/definitions/file.json We've missed in very useful docs https://www.rabbitmq.com/definitions.html /etc/definitions/definitions.json

{
  "global_parameters": [

  ],
  "users": [
    {
      "name": "management",
      "password": "xxx",
      "tags": "management"
    },
    {
      "name": "guest",
      "password": "xxx",
      "tags": "administrator"
    }
  ],
  "vhosts": [
    {
      "name": "/"
    }
  ],
  "permissions": [
    {
      "user": "guest",
      "vhost": "/",
      "configure": ".*",
      "read": ".*",
      "write": ".*"
    }
  ],
  "topic_permissions": [

  ],
  "parameters": [
    {"value":{"ack-mode":"on-confirm","trust-user-id":false,"uri":"amqp://localhost/"},
        "vhost":"/","component":"federation-upstream","name":"test"}
  ],
  "policies": [

  ],
  "queues": [

  ],
  "exchanges": [

  ],
  "bindings": [

  ]
}

Log

2020-08-07 09:37:19.903 [error] <0.522.0> BOOT FAILED
BOOT FAILED
2020-08-07 09:37:19.904 [error] <0.522.0> ===========
===========
2020-08-07 09:37:19.904 [error] <0.522.0> Exception during startup:
Exception during startup:
2020-08-07 09:37:19.905 [error] <0.522.0> 

2020-08-07 09:37:19.906 [error] <0.522.0>     rabbit:do_run_postlaunch_phase/0 line 905
    rabbit:do_run_postlaunch_phase/0 line 905
    lists:foreach/2 line 1338
    rabbit:'-do_run_postlaunch_phase/0-fun-12-'/1 line 907
2020-08-07 09:37:19.906 [error] <0.522.0>     lists:foreach/2 line 1338
2020-08-07 09:37:19.907 [error] <0.522.0>     rabbit:'-do_run_postlaunch_phase/0-fun-12-'/1 line 907
    rabbit_boot_steps:run_boot_steps/1 line 20
2020-08-07 09:37:19.907 [error] <0.522.0>     rabbit_boot_steps:run_boot_steps/1 line 20
2020-08-07 09:37:19.907 [error] <0.522.0>     rabbit_boot_steps:'-run_boot_steps/1-lc$^0/1-0-'/1 line 19
2020-08-07 09:37:19.907 [error] <0.522.0>     rabbit_boot_steps:run_step/2 line 46
2020-08-07 09:37:19.908 [error] <0.522.0>     rabbit_boot_steps:'-run_step/2-lc$^0/1-0-'/2 line 43
    rabbit_boot_steps:'-run_boot_steps/1-lc$^0/1-0-'/1 line 19
    rabbit_boot_steps:run_step/2 line 46
    rabbit_boot_steps:'-run_step/2-lc$^0/1-0-'/2 line 43
exit:{error,<<"Validation failed\n\ncomponent federation-upstream not found\n (//federation-upstream/test)">>}
2020-08-07 09:37:19.908 [error] <0.522.0> exit:{error,<<"Validation failed\n\ncomponent federation-upstream not found\n (//federation-upstream/test)">>}

2020-08-07 09:37:19.909 [error] <0.522.0> 
2020-08-07 09:37:21.124 [info] <0.522.0> Stopping RabbitMQ applications and their dependencies in the following order:
    rabbitmq_federation_management
    rabbitmq_shovel_management
    rabbitmq_management
    rabbitmq_shovel
    rabbitmq_federation
    amqp_client
    rabbitmq_prometheus
    rabbitmq_web_dispatch
    cowboy
    cowlib
    rabbitmq_management_agent
    rabbitmq_peer_discovery_k8s
    rabbitmq_peer_discovery_common
    rabbitmq_auth_backend_ldap
    rabbitmq_consistent_hash_exchange
    rabbit
    rabbitmq_prelaunch
    rabbit_common
    sysmon_handler
    os_mon
    ra
    mnesia
michaelklishin commented 4 years ago

@rerime which is https://github.com/rabbitmq/rabbitmq-server/issues/2384, so you want to upgrade to 3.8.6 either way.