Closed vuntz closed 6 years ago
I forgot to mention: I did the branch against master, but I don't anticipate issues in backporting that.
I think the code is good now; we however need to ship the new OCF resource agent, and fix some issues there:
Submitted to master with https://github.com/crowbar/crowbar-openstack/pull/887
All resource agents issues mentioned earlier are addressed now.
backporting it to our branch...
Backport here: sap-oc/crowbar-openstack/pull/41
Needs to be tested both manually and in the lab.
@SebastianBiedler - we would need to discuss this
Whe re-loading the proposal with ssh crowbar crowbar batch build < scenario.yaml
, I get this:
[08:58:24] rabbitmq barclamp, 'default' proposal:
Created
Failed to edit: default : Errors in data {"error":"Failed to validate proposal:
No device specified for shared storage.\nNo filesystem type specified for
shared storage.\n"}
Full output of error is in /tmp/crowbar_autobuild-err-20170615-14373-15fd0fg.html
Spoke with @SebastianBiedler on this topic. He suggests:
max_conn
being high enough on ha proxyAccording to the OpenStack HA-Guide. In this guide is mention not to use the haproxy because of the tcp-timeout failure. If you following the blog of John Eckersberg. http://john.eckersberg.com/improving-ha-failures-with-tcp-timeouts.html on which the HA-Guide reference, he mention that AMQP heartbeats should avoid this problem.Also he mention a patch for the haproxy. The blog is from March 2015 and AMQP heartbeat was a feature that wasn't implemented at this time. This feature is part of the oslo.messaging as stable feature since the liberty release. So that cloud be possible to use the haproxy to build a load balance HA solution. Of course the adjustment of the TCP settings in the haproxy and on operating system layer is important to provide a stable solution, but this should be determine in real test scenario which is the mirror of the productive workload and system. If we are using the solution with the rabbitmq-hosts parameter, we get a active/active hot stand by HA solution which is better than the actual active/passive solution but the benefit is much lower comparing to the load balance solution. Should it necessary that we must use this solution with rabbitmq-hosts than would it be create if we can increase the benefit of this solution by spread the load of the different OpenStack components. If every config have a different host a first host in the list of the rabbitmq-hosts parameter. Especially nova and neutron as the both power users should run on a different rabbitmq host. That would be another way to increase the benefit and spread the load in the system.
On the topic of not replicating all queues to all nodes:
Setting exactly
instead of all
for ha-mode
can be used to limit the number of mirrors:
To How Many Nodes to Mirror?
Note that mirroring to all queues is the most conservative option and is unnecessary in many cases. For clusters of 3 and more nodes it is recommended to mirror to a quorum (the majority) of nodes, e.g. 2 nodes in a 3 node cluster or 3 nodes in a 5 node cluster. Since some data can be inherently transient or very time sensitive, it can be perfectly reasonable to use a lower number of mirrors for some queues (or even not use any mirroring).
Just for completeness: this was done.
I started a branch for this, as it turned out to be easier than I expected: https://github.com/vuntz/crowbar-openstack/tree/rabbitmq-cluster
There are still a couple of issues:
Basic testing seems to be positive, but I'd definitely welcome some help there.
Now, the concern I have is that switching to the clustering mode of rabbitmq will require a restart of nearly all openstack services (APIs, agents, and so on). So this will create downtime... I'm unclear about the concrete impact it would have on a setup with 50+ nodes.