rabbitmq / chef-cookbook

Development repository for Chef cookbook RabbitMQ
https://supermarket.chef.io/cookbooks/rabbitmq
Apache License 2.0
214 stars 423 forks source link

[reopen of #125 as long discussion ] Node not join to cluster #359

Open chrisduong opened 8 years ago

chrisduong commented 8 years ago

Hi,

I'm using rabbitmq cookbook v.4.7.0, and installed latest RabbitMQ version 3.6.1, I noticed that that the LWRP rabbitmq_cluster would only join the node into the cluster only_if "the node is not running any cluster.

However, whenever RabbitMQ server starts it would run in single cluster mode with the cluster name is the node's name itself.

This is the cluster status when the node2 first startup.

[root@node2 ~]# rabbitmqctl cluster_status Cluster status of node rabbit@node2 ... [{nodes,[{disc,[rabbit@node2]}]}, {running_nodes,[rabbit@node2]}, {cluster_name,<"rabbit@node2">}, {partitions,[]}, {alarms,[{rabbit@node2,[]}]}]

Which means the code block joined_cluster?(var_node_name, var_cluster_status) always return true, and Chef would complain and not join to the cluster:

Chef::Log.warn("[rabbitmq_cluster] Node is already member of #{current_cluster_name(var_cluster_status)}. Joining cluster will be skipped.")

The LWRP only join with the first node in the array, so it make more sense that we should check the cluster status from that node only (for preventing failure when joining) than checking the running_nodes from the "joining node".

sadowskik commented 8 years ago

I'm just dealing with exactly the same issue.

If this check was omitted for each broker beside the first one, it would result in invoking stop_app unnecessary at each run, causing other brokers to be unavailable.

When the node is already a member of the cluster, the var_node_name_to_join is always present in running_nodes list. The most straightforward solution would be to just change this line to: joined_cluster?(var_node_name_to_join, var_cluster_status).

Unfortunately, it's not that simple :) The cluster_status result is different in case of a partition and I'm not really sure if it's desired to try rejoining the cluster whenever the first broker is partitioned from the rest of the cluster.

To briefly sum up, we have a two paths to follow from here:

  1. Simple: check if the first broker is on the running_nodes list. If yes, this means the current broker has been already clustered
  2. More comprehensive: consider various partition scenarios
DavidKaz commented 8 years ago

@chrisduong @sadowskik consider this solution https://github.com/jjasghar/rabbitmq/issues/387

akadoya commented 8 years ago

I have created a pull request for this issue https://github.com/jjasghar/rabbitmq/pull/380 at least this works for me