Open akissa opened 7 years ago
Clearing the timer and logging a warning if an ETS row already exists would help. This plugin doesn't really assume it can be disabled but I guess various provisioning tools do it for one reason or another, possibly unintentionally.
@michaelklishin Its actually on initial activation that this issue occurs, my current work flow is as follows.
I have not tested offline enabling of the plugin.
The plugin per say does not fail to activate
but the rabbitmq-plugins enable
returns the error code 70
instead of 0
which trips up saltstack
and causes the state to fail.
Still, it's not really a workflow this plugin assumes. It makes more sense to enable it before starting the node with rabbitmq-plugins --offline enable …
. Let's see how often this issue comes up in the future.
Making autocluster_periodic:start_delayed/3
idempotent is trivial and probably a good idea in general but this plugin is on life support, so we'll see when we get to it.
When enabling the plugin on the first node in a cluster the following error is generated.
Investigating this seems to point to a race condition in the logic, The plugin runs through the steps initially and acquires a lock and possibly inserts into
ets
, then on determining that it is the only node goes through the steps again while still holding the initial lock.Retaining the initial lock itself could be a problem, after the initial lock is released the process then proceeds to register with
consul
, after registration the issue then arises when setting up the delayed task.It seems the initial run may have already created this, thus a
false
is returned fromets:insert_new
whose documentation[1] says afalse
is returned if keys are already exist.In this state, if you then disable the plugin, it keeps running and trying to recreate the node in
consul
. I suspect the delayed task is not removed.A spiral loop is then entered because
consul
returns500
result code when the a request is made to check the state of a service that does not exist.https://github.com/rabbitmq/rabbitmq-autocluster/blob/stable/src/autocluster_consul.erl#L194 https://github.com/rabbitmq/rabbitmq-autocluster/blob/stable/src/autocluster_consul.erl#L212
The only way to get out of this is to restart the
rabbitmq-server
process.I do not have much time at the moment to dig through this so i have created a workaround[2] to stop the enable error until i have time later. If someone else is able to look it to this it would be great.
[1] http://erlang.org/doc/man/ets.html#insert_new-2 [2] https://github.com/akissa/rabbitmq-autocluster/commit/39e0cb4444819fcf5c4d9bc740acc3fa144caddc