rabbitmq / rabbitmq-autocluster

RabbitMQ peer discovery and cluster formation plugin, supports RabbitMQ 3.6.x
BSD 3-Clause "New" or "Revised" License
241 stars 54 forks source link

Plugin re-activation can fail #57

Open akissa opened 7 years ago

akissa commented 7 years ago

When enabling the plugin on the first node in a cluster the following error is generated.

[root@db3 ~]# rabbitmq-plugins enable autocluster
The following plugins have been enabled:
  rabbitmq_aws
  autocluster

Applying plugin configuration to rabbit@db3.home.topdog-software.com... failed.
Error: {{badmatch,false},
        [{autocluster_periodic,start_delayed,3,
                               [{file,"src/autocluster_periodic.erl"},
                                {line,47}]},
         {autocluster_consul,register,0,
                             [{file,"src/autocluster_consul.erl"},{line,135}]},
         {autocluster,register_with_backend,1,
                      [{file,"src/autocluster.erl"},{line,307}]},
         {autocluster,run_steps,1,[{file,"src/autocluster.erl"},{line,131}]},
         {rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
                            [{file,"src/rabbit_boot_steps.erl"},{line,49}]},
         {rabbit_boot_steps,run_step,2,
                            [{file,"src/rabbit_boot_steps.erl"},{line,49}]},
         {rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,
                            [{file,"src/rabbit_boot_steps.erl"},{line,26}]},
         {rabbit_boot_steps,run_boot_steps,1,
                            [{file,"src/rabbit_boot_steps.erl"},{line,26}]}]}

Investigating this seems to point to a race condition in the logic, The plugin runs through the steps initially and acquires a lock and possibly inserts into ets, then on determining that it is the only node goes through the steps again while still holding the initial lock.

Retaining the initial lock itself could be a problem, after the initial lock is released the process then proceeds to register with consul, after registration the issue then arises when setting up the delayed task.

It seems the initial run may have already created this, thus a false is returned from ets:insert_new whose documentation[1] says a false is returned if keys are already exist.

In this state, if you then disable the plugin, it keeps running and trying to recreate the node in consul. I suspect the delayed task is not removed.

A spiral loop is then entered because consul returns 500 result code when the a request is made to check the state of a service that does not exist.

https://github.com/rabbitmq/rabbitmq-autocluster/blob/stable/src/autocluster_consul.erl#L194 https://github.com/rabbitmq/rabbitmq-autocluster/blob/stable/src/autocluster_consul.erl#L212

The only way to get out of this is to restart the rabbitmq-server process.

I do not have much time at the moment to dig through this so i have created a workaround[2] to stop the enable error until i have time later. If someone else is able to look it to this it would be great.

[1] http://erlang.org/doc/man/ets.html#insert_new-2 [2] https://github.com/akissa/rabbitmq-autocluster/commit/39e0cb4444819fcf5c4d9bc740acc3fa144caddc

michaelklishin commented 7 years ago

Clearing the timer and logging a warning if an ETS row already exists would help. This plugin doesn't really assume it can be disabled but I guess various provisioning tools do it for one reason or another, possibly unintentionally.

akissa commented 7 years ago

@michaelklishin Its actually on initial activation that this issue occurs, my current work flow is as follows.

I have not tested offline enabling of the plugin.

The plugin per say does not fail to activate but the rabbitmq-plugins enable returns the error code 70 instead of 0 which trips up saltstack and causes the state to fail.

michaelklishin commented 6 years ago

Still, it's not really a workflow this plugin assumes. It makes more sense to enable it before starting the node with rabbitmq-plugins --offline enable …. Let's see how often this issue comes up in the future.

Making autocluster_periodic:start_delayed/3 idempotent is trivial and probably a good idea in general but this plugin is on life support, so we'll see when we get to it.