rabbitmq / rabbitmq-autocluster

RabbitMQ peer discovery and cluster formation plugin, supports RabbitMQ 3.6.x
BSD 3-Clause "New" or "Revised" License
241 stars 54 forks source link

Autocluster attempts to create a session in Consul before session endpoint ready #42

Closed madAndroid closed 7 years ago

madAndroid commented 7 years ago

We're seeing this when bringing up a RabbitMQ cluster on the same nodes that act as Consul servers: when the autocluster plugin attempts to connect to the session endpoint to create a lock, in order to overcome the race condition on startup, Consul returns a 500 when the session endpoint is unavailable, and as a result the RabbitMQ server does not start at all.

We've been able to work around this by creating a script which polls the session endpoint in Consul until it's available, and then in our Puppet manifests, we ensure this script runs first before the RabbitMQ server is started.

Ideally, we'd expect that the autocluster plugin would poll for the availability of the session endpoint before creating a session/lock, and retry - without this, the Rabbit daemon doesn't start properly, so it feels like this is something that the autocluster plugin should be doing, to make sure Consul is completely ready before attempting to create the lock/session.

michaelklishin commented 7 years ago

Managing Consul is something out of scope for this plugin. Consul is assumed to be available by the time nodes start (this is generally the case with all other backends). If that's not the case, some would argue that deployment should fail immediately instead of retrying.

That said, this plugin does retry on Consul's 500 responses in some cases but it's something that can happen during normal Consul operations (from our experience) and there is no other workaround that we are aware of.

michaelklishin commented 7 years ago

@dcorbacho @Gsantomaggio if you feel this is something that makes sense for us to do, feel free to re-open.

Gsantomaggio commented 7 years ago

@madAndroid you could try to increase the AUTOCLUSTER_DELAY. Even if it is not a definitive solution, it is worth to try.

By default is a low value (5 seconds).

madAndroid commented 7 years ago

@Gsantomaggio - thank you, that last commit might help with unavailability of the session endpoint :)

Gsantomaggio commented 7 years ago

@madAndroid if you have a chance please let us know. This autocluster-3.6.11.zip contains the fix.