Closed madAndroid closed 7 years ago
Managing Consul is something out of scope for this plugin. Consul is assumed to be available by the time nodes start (this is generally the case with all other backends). If that's not the case, some would argue that deployment should fail immediately instead of retrying.
That said, this plugin does retry on Consul's 500 responses in some cases but it's something that can happen during normal Consul operations (from our experience) and there is no other workaround that we are aware of.
@dcorbacho @Gsantomaggio if you feel this is something that makes sense for us to do, feel free to re-open.
@madAndroid you could try to increase the AUTOCLUSTER_DELAY
.
Even if it is not a definitive solution, it is worth to try.
By default is a low value (5 seconds).
@Gsantomaggio - thank you, that last commit might help with unavailability of the session endpoint :)
@madAndroid if you have a chance please let us know. This autocluster-3.6.11.zip contains the fix.
We're seeing this when bringing up a RabbitMQ cluster on the same nodes that act as Consul servers: when the autocluster plugin attempts to connect to the session endpoint to create a lock, in order to overcome the race condition on startup, Consul returns a 500 when the session endpoint is unavailable, and as a result the RabbitMQ server does not start at all.
We've been able to work around this by creating a script which polls the session endpoint in Consul until it's available, and then in our Puppet manifests, we ensure this script runs first before the RabbitMQ server is started.
Ideally, we'd expect that the autocluster plugin would poll for the availability of the session endpoint before creating a session/lock, and retry - without this, the Rabbit daemon doesn't start properly, so it feels like this is something that the autocluster plugin should be doing, to make sure Consul is completely ready before attempting to create the lock/session.