Closed mauricioalarcon closed 9 years ago
I think the problem is that, when the package gets installed, the service starts up right away, before Chef has put the config and cookie into place. The docs say that automatically joining the cluster only happens the first time it starts, and only if the config is set up before it starts. After that, it requires "rabbitmqctl reset" to clear out the state and try to join the cluster.
I apologize, I was using an old version of the cookbook, let me try out the new one that includes the reset step.
I have this issue as well on cookbook 3.2.2. Is there a way to delay the service start until after the configs are in place?
Otherwise I'm already writing a wrapper cookbook b/c our firm needs a more custom rabbitmq.config, so I'm rewinding this cookbook via the wrapper. I can include a reset call in my wrapper if need be I suppose.
Thanks for the reply, indeed there's something odd in the way the nodes join the cluster. When I spin up one wait for that one to come up, and then I launch another one both nodes see each other without any issue.
I didn't do any "rabbitmqctl reset", I guess the issue is just timing.. But I'm eager to see your results with the reset step on it.
Cheers
In my case, after I updated the cookbook to the version that includes the node reset, I discovered that some of my nodes were running different versions, which prevented the automatic clustering after the node reset. I had installed some of them from the Ubuntu repo, and the later one from the RabbitMQ repo.
The docs say that automatically joining the cluster only happens the first time it starts, and only if the config is set up before it starts. After that, it requires "rabbitmqctl reset" to clear out the state and try to join the cluster. You have to add some logic for each node to join the cluster, with rabbitmqctl join_cluster. I have done something like this in my own cookbook, which wrapps this one. Please let me know how I can provide that piece? I don't want to do a pull request, since it's not an elegant solution and I would love some feedback.
To resolve this issue, I wrote cookbook that doing rabbitmqctl join_cluster. https://supermarket.chef.io/cookbooks/rabbitmq-cluster
@sunggun-yu any chance we can roll this up into the main cookbook? it seems odd to call a wrapper cookbook to do something the main one should take care of.
+1 please include a way to join cluster (or make auto-cluster features works as intended) in this cookbook.
+1
@sunggun-yu any thoughts?
I can work on integrating this @jjasghar
@jjasghar oh, I'm so sorry for late checking your message. that's good idea. if you don't mind. I'll merge it into main cookbook.
@jjasghar , @cmluciano
working on now. also, I'm changing some concepts for rabbitmq cookbook. i'll report (or document) changes later.
@jjasghar , @cmluciano Major changes that I currently working on is here ,or more comes :-)
master_node_name
to cluster_name
: by this rabbitmq cluster could support custom cluster name (like, rabbit_dev) @jjasghar , @cmluciano
Sorry for making confused on my previous comment.
removing master/slave : in actually, there is no concept of master and slave in Rabbitmq cluster.
node should join in first node in the cluster. so i'll keep master/slave
changing attribute name master_node_name
to cluster_name
: by this rabbitmq cluster could support custom cluster name (like, rabbit_dev)
was not worked. I misunderstood regarding this.
code is committed in https://github.com/sunggun-yu/rabbitmq/tree/feature/cluster I hope I can send pull request soon after finish some more testing and documentation.
Thank you.
What is the targeted behaviour when the first node is not working (outage or maintenance) ? Can't we just point at any cluster endpoint when joining ?
On my wrapper I did the following resource :
rabbitmq_cluster "rabbit@#{node['rabbitmq']['cluster_disk_nodes'][0].split('@')[1]}" do
node_type 'slave'
cluster_node_type 'disc'
action :join
end
There's no master
type resource at all. I'm always targeting the first node of my list, and it's it's working, even for the first node of the cluster or the last one when the first was shudown (I'll test it again today).
I've just tested this and it worked perfectly : 1) Create the first node (rabbit2). At the end of the chef run, cluster node array is :
["rabbit@rabbit2"]
So we can say that "rabbit2" should be the "master" at this point ...
2) Add a second node (rabbit3). He joins the cluster and then our array is :
["rabbit@rabbit2", "rabbit@rabbit3"]
3) Add "rabbit1" node , he comes first because my chef search which fills my cluster nodes array is sorted. So ... rabbit1 should became the new "master" and all should blow up ... but no !
Rabbit2 is still the RMQ master and my array is now :
["rabbit@rabbit1", "rabbit@rabbit2", "rabbit@rabbit3"]
Conclusion: I think we can really get rid of master/slave information and work with any node of the cluster :-/ Am I wrong ?
Edit : the array changed after adding node1, so next time rabbit2 & 3 will converge they will trigger some resources and maybe restart / reload ... I know :)
Last post :
3bis) reconverge 2 & 3 nodes
4) stop rabbitmq on rabbit1
5) add rabbit4 node.
[2015-03-02T09:51:47+00:00] ERROR: rabbitmq-cluster[rabbit@rabbit1] (itprod-rabbitmq::default line 43) had an error: Mixlib::ShellOut::ShellCommandFailed: execute[rabbitmqctl stop_app && rabbitmqctl join_cluster --ram rabbit@rabbit1 && rabbitmqctl start_app] (/var/chef/cache/cookbooks/rabbitmq-cluster/providers/default.rb line 60) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '2'
---- Begin output of rabbitmqctl stop_app && rabbitmqctl join_cluster --ram rabbit@rabbit1 && rabbitmqctl start_app ----
STDOUT: Stopping node rabbit@rabbit4 ...
Clustering node rabbit@rabbit4 with rabbit@rabbit1 ...
STDERR: Error: unable to connect to nodes [rabbit@rabbit1]: nodedown
DIAGNOSTICS
===========
attempted to contact: [rabbit@rabbit1]
rabbit@rabbit1:
* connected to epmd (port 4369) on rabbit1
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbit1
* suggestion: start the node
current node details:
- node name: 'rabbitmqctl-13196@rabbit4'
- home dir: /var/lib/rabbitmq
- cookie hash: sSZxjY3Kv/yi/dXqo2scmw==
---- End output of rabbitmqctl stop_app && rabbitmqctl join_cluster --ram rabbit@rabbit1 && rabbitmqctl start_app ----
Ran rabbitmqctl stop_app && rabbitmqctl join_cluster --ram rabbit@rabbit1 && rabbitmqctl start_app returned 2
[2015-03-02T09:51:47+00:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)
It fails :'( because rabbit1 is the first node of the array and is currently down. It must be a way to makes it works by any way. Including "master" concept is non-sense and force us to make weird things when it don't simply freeze or breaks cluster.
Edit : ATM I think i'll test the RMQ port of each node of my array to determine which one can be joined, before i'll try to join the cluster
The solution would probably be to pass an array of cluster nodes to the LWRP and let him choose on which node it will make its join operation (LWRP would test & validate the node)?
This is by far the smartest solution IMO. @sunggun-yu : What do you think about that strategy ?
@BarthV
Thank you for your testing and details. I totally agree with removing master/slaver concept from clustering cookbook. (however, it was hard to remove)
I'll test your scenario and provide current scenario that I covered. also, my opinion as well.
@BarthV
using array of node list is good idea. however, I think there is some pros and cons. also, I can't agree with choosing one of the node in the list when first node is not responding.
We can use first node concept instead of master/slave. also, we can reduce the attribute in cluster cookbook. the array of cluster nodes clearly says which node is master and cluster name as well.
We need to keep migrating the cluster_nodes list. for example, when we want to add rabbit4, rabbit[1-3] should update cluster_nodes value. we can use data_bag for this. but we cannot force to use.
It may occurs unexpected behavior on cluster. I have attached diagram in below. and I've tested. please take a look at the 3rd scenario (3-3). same issue could be happen even though we use the cluster_nodes[0] as master, but it is predictable. however, to use one of the node in the list is not predictable.
I would like to go with [:rabbitmq][:clustering][:cluster_nodes]=['rabbit@rabbit1', 'rabbit@rabbit2', 'rabbit@rabbit3']
attribute that first node is mandatory and the other nodes is optional.
@jjasghar can we redefine the [:rabbitmq][:cluster] = true
to [:rabbitmq][:cluster][:enable] = true
?
@sunggun-yu : First of all, thank you for this specs work.
After reading these posts, I thought that it might be better to store independently the cluster name (static) and cluster nodes array (which moves with time).
So the cluster name would be initialized with the very first cluster node and then will never change. Maybe it's a too "stupid simple" approach but I like it :-)
edit : I'm testing it
OPTION 1 : Joining cluster with rabbitmqctl join
We start with a 2 node cluster :
rabbit@rabbit2
root@rabbit1:~# rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit1 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2]}]},
{running_nodes,[rabbit@rabbit2,rabbit@rabbit1]},
{cluster_name,<<"rabbit@rabbit2.labs.acme.com">>},
{partitions,[]}]
root@rabbit2:~# rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit2 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2]}]},
{running_nodes,[rabbit@rabbit1,rabbit@rabbit2]},
{cluster_name,<<"rabbit@rabbit2.labs.acme.com">>},
{partitions,[]}]
Let's make rabbit4 join the party :
root@rabbit4:~# dpkg -i rabbitmq-server_3.4.4-1_all.deb
[...]
root@rabbit4:~# rabbitmqctl stop_app
Stopping node rabbit@rabbit4 ...
root@rabbit4:~# echo "<my_cookie>" > /var/lib/rabbitmq/.erlang.cookie
root@rabbit4:~# rabbitmqctl join_cluster rabbit@rabbit1
Clustering node rabbit@rabbit4 with rabbit@rabbit1 ...
root@rabbit4:~# rabbitmqctl start_app
Starting node rabbit@rabbit4 ...
root@rabbit1:~# rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit1 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit4]}]},
{running_nodes,[rabbit@rabbit4,rabbit@rabbit2,rabbit@rabbit1]},
{cluster_name,<<"rabbit@rabbit2.labs.acme.com">>},
{partitions,[]}]
root@rabbit2:~# rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit2 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit4]}]},
{running_nodes,[rabbit@rabbit4,rabbit@rabbit1,rabbit@rabbit2]},
{cluster_name,<<"rabbit@rabbit2.labs.acme.com">>},
{partitions,[]}]
root@rabbit4:~# rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit4 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit4]}]},
{running_nodes,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit4]},
{cluster_name,<<"rabbit@rabbit2.labs.acme.com">>},
{partitions,[]}]
Conclusion : We don't care about cluster name. We only need to point to single running node of the cluster.
OPTION 2 : Joining cluster with RabbitMQ autocluster feature
We start (again) with a 2 nodes cluster :
rabbit@rabbit2
root@rabbit1:~# rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit1 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2]}]},
{running_nodes,[rabbit@rabbit2,rabbit@rabbit1]},
{cluster_name,<<"rabbit@rabbit2.labs.acme.com">>},
{partitions,[]}]
root@rabbit2:~# rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit2 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2]}]},
{running_nodes,[rabbit@rabbit1,rabbit@rabbit2]},
{cluster_name,<<"rabbit@rabbit2.labs.acme.com">>},
{partitions,[]}]
Let's make rabbit4 join the party :
root@rabbit4:~# dpkg -i rabbitmq-server_3.4.4-1_all.deb
[...]
root@rabbit4:~# rabbitmqctl stop_app
Stopping node rabbit@rabbit4 ...
root@rabbit4:~# echo "<my_cookie>" > /var/lib/rabbitmq/.erlang.cookie
root@rabbit4:~# rabbitmqctl reset
Resetting node rabbit@rabbit4 ...
Edit /etc/rabbitmq/rabbitmq.config file :
cluster_nodes
array.[...]
{rabbit, [
{cluster_nodes, {['rabbit@rabbit3','rabbit@rabbit1','rabbit@rabbit2','rabbit@rabbit4'], disc}},
{cluster_partition_handling,ignore},
{tcp_listen_options, [binary, {packet,raw},
{reuseaddr,true},
{backlog,128},
[...]
Restart rabbit4 :
root@rabbit4:~# rabbitmqctl start_app
Starting node rabbit@rabbit4 ...
root@rabbit4:~# rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit4 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit4]}]},
{running_nodes,[rabbit@rabbit2,rabbit@rabbit1,rabbit@rabbit4]},
{cluster_name,<<"rabbit@rabbit2.labs.acme.com">>},
{partitions,[]}]
After that :
root@rabbit1:~# rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit1 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit4]}]},
{running_nodes,[rabbit@rabbit4,rabbit@rabbit2,rabbit@rabbit1]},
{cluster_name,<<"rabbit@rabbit2.labs.acme.com">>},
{partitions,[]}]
root@rabbit2:~# rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit2 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit4]}]},
{running_nodes,[rabbit@rabbit4,rabbit@rabbit1,rabbit@rabbit2]},
{cluster_name,<<"rabbit@rabbit2.labs.acme.com">>},
{partitions,[]}]
Conclusion : We can totally get rid of all these considerations and let RMQ do all the work for us at the very first start of RMQ service in a new node. And (again) cluster name is absolutely -not- important.
Edit : RabbitMQ4 starting log :
=INFO REPORT==== 5-Mar-2015::11:06:29 ===
Node 'rabbit@rabbit1' selected for auto-clustering
There are no master nodes in RabbitMQ clusters. Individual queues have masters (originally the node to which the client declaring it was connected). When a cluster restarts the last node to shut down is "special" in that it should be started first. Otherwise, all nodes are equal. You can cluster with any existing cluster node as long as your node has the same Erlang cookie.
Until the new declarative clustering plugin is released to the public (no promises or dates), it makes sense for this cookbook to list cluster nodes in the RabbitMQ config (option 2 in the comment above) and delay RabbitMQ service [re-]start until the cookie is in place. If there's anything the RabbitMQ team can do to make this easier in our packages, let us know on rabbitmq-users.
@michaelklishin : thanks for this comment.
The question is for us "When and how do we trigger a reset / auto-join for a node ?".
I vote for :
dpkg -i
operation, and only after that, to prevent nodes erasing themselves at very bad time.You don't want to reset nodes that are already cluster members in a cookbook. Newly added nodes don't need a reset (there is nothing to reset on them) until their first start (in which a new database, which assumes no clustering, is initialised).
So, resetting after the package was installed and the service was started sounds reasonable to me.
By the way, docs on auto-clustering.
The more I think about this, the more I am convinced that the right thing to do is
rabbitmq-server
is not started before we can be sure rabbitmq.config
and the Erlang cookie are in place.In that case newly added nodes should join the cluster fine as long as one of the nodes listed in cluster_nodes
is reachable.
FTR, Iin the alternative solution we have at Pivotal, nodes wait for seed node to become available, which makes things a lot easier to automate.
@BarthV @michaelklishin
Thank you for the replies guys. I think, this is best time to reconsider not only manual clustering but also auto clustering. @jjasghar we can use same data structure for both manual and auto clustering.
default[:rabbitmq][:clustering][:use_auto_clustering] = false
default[:rabbitmq][:clustering][:cluster_nodes] =
[
{
:name => 'rabbit@rabbit1',
:type => 'disc'
},
{
:name => 'rabbit@rabbit2',
:type => 'disc'
},
{
:name => 'rabbit@rabbit3',
:type => 'ram'
}
]
also, @BarthV I'm negative to use cluster_name. people may confusing with this. in actually, I confused :-) http://www.rabbitmq.com/man/rabbitmqctl.1.man.html
Let me know if you need some help to write recipe / helpers / LWRP or even to test your work.
@sunggun-yu the provided attributes seem to have everything you need to use auto-clustering.
Cluster name is orthogonal. It is certainly helpful for those who use federation and shovel plugins, or manage multiple clusters.
@BarthV Thank you!!! :+1:
@jjasghar btw, I made some progress with the attribute data that I provided.
I tried some different approaches and I almost decided go with in below.
all the data of [:rabbitmq][:clustering][:cluster_nodes]
array value will be passed to provider as resource (String). and the String will be parsed as JSON object in the provider. JSON.parse(new_resource.cluster_nodes.gsub('=>',':'))
provider will find the first node and cluster node type of current node from JSON object to make :join
and :change_cluster_node_type
action.
if node[:rabbitmq][:cluster]
if node[:rabbitmq][:clustering][:use_auto_clustering]
# Do auto clustering
else # Do Manual clustering
# Join in cluster
rabbitmq_cluster "#{node[:rabbitmq][:clustering][:cluster_nodes]}" do
action :join
end
# Change the cluster node type
rabbitmq_cluster "#{node[:rabbitmq][:clustering][:cluster_nodes]}" do
action :change_cluster_node_type
end
end
end
I've tried this way as well. I believe this is lighter and more efficient way in LWRP perspective.
however, it was hard to get current node name. I tried shell_out('rabbitmqctl eval "node()."').stdout
. but it occured error because, it was executed before rabbitmq installed.
I could use current_node_name = "rabbit@#{node[:hostname]}"
to make a node name. but, I thought using static value is not a preferred way.
if node[:rabbitmq][:cluster]
if node[:rabbitmq][:clustering][:use_auto_clustering]
# Do auto clustering
else # Do Manual clustering
# Join in cluster
rabbitmq_cluster 'join_cluster' do
node_name_to_join node[:rabbitmq][:clustering][:cluster_nodes].first[:name]
action :join
end
# Change the cluster node type
# shell_out('rabbitmqctl eval "node()."').stdout.chomp was not work - script has executed before rabbitmq installed
current_node_name = "rabbit@#{node[:hostname]}"
node_type = node[:rabbitmq][:clustering][:cluster_nodes].select { |node| node[:name] == current_node_name }
node_type = node_type.first[:type]
rabbitmq_cluster 'change_cluster_node_type' do
cluster_node_type node_type
action :change_cluster_node_type
end
end
end
also, I just added the feature for set_cluster_name in manual clustering.
requirements :
it was easy to implement since we have cluster_nodes attribute and select first node for join action. http://www.rabbitmq.com/man/rabbitmqctl.1.man.html
latest changes are committed. : https://github.com/sunggun-yu/rabbitmq/tree/feature/cluster
also, you can test with test Vagrant cluster https://github.com/sunggun-yu/vagrant-chef-rabbitmq-cluster
FYI :
to test chef-client behavior, I use cd /tmp/vagrant-chef ; chef-solo -c solo.rb -j dna.json
on each vagrant vm.
moving on to auto clustering.
I added set_cluster_name action. it is merged in join action. and it will be executed when node is first node. however, it is more clear to have separate action. in actually, any of node in the cluster can set the cluster name in the RMQ
@jjasghar I finished up. and sent pull request https://github.com/jjasghar/rabbitmq/pull/238
I hope it would be helpful.
Vagrant test box for auto clustering : https://github.com/sunggun-yu/vagrant-chef-rabbitmq-cluster/tree/test/auto_clustering
Vagrant test box for manual clustering : https://github.com/sunggun-yu/vagrant-chef-rabbitmq-cluster/tree/test/manual_clustering
This should be resolved via #238, if not please reopen.
Hey there,
I'm using this chef recipe with Amazon opsWorks to create a cluster there. Everything seems to work perfect except for the fact that once the nodes are provisioned and all the configuration is place none of the nodes join each other as a cluster.
I have to go on each node and join them manually using
rabbitmqclt
. The configuration generated by the recipe seems correct. But still I need to do the manual job of join them together.e.g.
What is that that I'm doing wrong? is this a know issue?