pythian / opsviz

25 stars 24 forks source link

Restarting rabbitmq1 node results in failed setup on 'set_policy ha-all' #7

Open dtest opened 9 years ago

dtest commented 9 years ago

I had stopped all instances in my opswork stack and upon restarting the rabbitmq1 instance, received the following error:

[2014-12-12T14:36:27+00:00] INFO: Enabling RabbitMQ plugin 'rabbitmq_management'.
[2014-12-12T14:36:27+00:00] INFO: rabbitmq_plugin[rabbitmq_management] not queuing delayed action restart on service[rabbitmq-server] (delayed), as it's already been queued
[2014-12-12T14:36:27+00:00] INFO: Processing execute[rabbitmq-plugins enable rabbitmq_management] action run (/var/lib/aws/opsworks/cache.stage2/cookbooks/rabbitmq/providers/plugin.rb line 39)
[2014-12-12T14:36:28+00:00] INFO: execute[rabbitmq-plugins enable rabbitmq_management] ran successfully
[2014-12-12T14:36:28+00:00] INFO: Processing rabbitmq_plugin[rabbitmq_management_visualiser] action enable (rabbitmq::mgmt_console line 27)
[2014-12-12T14:36:28+00:00] INFO: Enabling RabbitMQ plugin 'rabbitmq_management_visualiser'.
[2014-12-12T14:36:28+00:00] INFO: rabbitmq_plugin[rabbitmq_management_visualiser] not queuing delayed action restart on service[rabbitmq-server] (delayed), as it's already been queued
[2014-12-12T14:36:28+00:00] INFO: Processing execute[rabbitmq-plugins enable rabbitmq_management_visualiser] action run (/var/lib/aws/opsworks/cache.stage2/cookbooks/rabbitmq/providers/plugin.rb line 39)
[2014-12-12T14:36:29+00:00] INFO: execute[rabbitmq-plugins enable rabbitmq_management_visualiser] ran successfully
[2014-12-12T14:36:29+00:00] INFO: Processing execute[chown -R rabbitmq:rabbitmq /var/lib/rabbitmq] action run (rabbitmq_cluster::default line 10)
[2014-12-12T14:36:29+00:00] INFO: execute[chown -R rabbitmq:rabbitmq /var/lib/rabbitmq] ran successfully
[2014-12-12T14:36:29+00:00] INFO: Processing rabbitmq_user[guest] action delete (rabbitmq_cluster::default line 12)
[2014-12-12T14:36:29+00:00] INFO: Processing rabbitmq_policy[ha-all] action set (rabbitmq_cluster::default line 16)
[2014-12-12T14:36:29+00:00] INFO: Done setting RabbitMQ policy 'ha-all'.
[2014-12-12T14:36:29+00:00] INFO: Processing execute[set_policy ha-all] action run (/var/lib/aws/opsworks/cache.stage2/cookbooks/rabbitmq/providers/policy.rb line 66)

================================================================================
Error executing action `run` on resource 'execute[set_policy ha-all]'
================================================================================

Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '2'
---- Begin output of rabbitmqctl set_policy ha-all "^(?!amq\.).*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1 ----
STDOUT: Setting policy "ha-all" for pattern "^(?!amq\\.).*" to "{\"ha-mode\":\"all\",\"ha-sync-mode\":\"automatic\"}" with priority "1" ...
STDERR: Error: unable to connect to node rabbit@rabbitmq1: nodedown

DIAGNOSTICS
===========

attempted to contact: [rabbit@rabbitmq1]

rabbit@rabbitmq1:
* connected to epmd (port 4369) on rabbitmq1
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq1
* suggestion: start the node

current node details:
- node name: rabbitmqctl11602@rabbitmq1
- home dir: /var/lib/rabbitmq
- cookie hash: FUWzw5ayMo2aD4GJFavYFA==
---- End output of rabbitmqctl set_policy ha-all "^(?!amq\.).*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1 ----
Ran rabbitmqctl set_policy ha-all "^(?!amq\.).*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1 returned 2

Resource Declaration:
---------------------
# In /var/lib/aws/opsworks/cache.stage2/cookbooks/rabbitmq/providers/policy.rb

66:     execute "set_policy #{new_resource.policy}" do
67:       command cmd
68:     end
69: 

Compiled Resource:
------------------
# Declared in /var/lib/aws/opsworks/cache.stage2/cookbooks/rabbitmq/providers/policy.rb:66:in `block in class_from_file'

execute("set_policy ha-all") do
action "run"
retries 0
retry_delay 2
command "rabbitmqctl set_policy ha-all \"^(?!amq\\.).*\" '{\"ha-mode\":\"all\",\"ha-sync-mode\":\"automatic\"}' --priority 1"
backup 5
returns 0
cookbook_name "rabbitmq_cluster"
end

[2014-12-12T14:36:29+00:00] INFO: Running queued delayed notifications before re-raising exception
[2014-12-12T14:36:29+00:00] INFO: template[/etc/rabbitmq/rabbitmq-env.conf] sending restart action to service[rabbitmq-server] (delayed)
[2014-12-12T14:36:29+00:00] INFO: Processing service[rabbitmq-server] action restart (rabbitmq::default line 79)
[2014-12-12T14:36:33+00:00] INFO: service[rabbitmq-server] restarted
[2014-12-12T14:36:33+00:00] ERROR: Running exception handlers
[2014-12-12T14:36:33+00:00] ERROR: Exception handlers complete
[2014-12-12T14:36:33+00:00] FATAL: Stacktrace dumped to /var/lib/aws/opsworks/cache.stage2/chef-stacktrace.out
[2014-12-12T14:36:33+00:00] ERROR: execute[set_policy ha-all] (/var/lib/aws/opsworks/cache.stage2/cookbooks/rabbitmq/providers/policy.rb line 66) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '2'
---- Begin output of rabbitmqctl set_policy ha-all "^(?!amq\.).*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1 ----
STDOUT: Setting policy "ha-all" for pattern "^(?!amq\\.).*" to "{\"ha-mode\":\"all\",\"ha-sync-mode\":\"automatic\"}" with priority "1" ...
STDERR: Error: unable to connect to node rabbit@rabbitmq1: nodedown

DIAGNOSTICS
===========

attempted to contact: [rabbit@rabbitmq1]

rabbit@rabbitmq1:
* connected to epmd (port 4369) on rabbitmq1
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq1
* suggestion: start the node

current node details:
- node name: rabbitmqctl11602@rabbitmq1
- home dir: /var/lib/rabbitmq
- cookie hash: FUWzw5ayMo2aD4GJFavYFA==
---- End output of rabbitmqctl set_policy ha-all "^(?!amq\.).*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1 ----
Ran rabbitmqctl set_policy ha-all "^(?!amq\.).*" '{"ha-mode":"all","ha-sync-mode":"automatic"}' --priority 1 returned 2
[2014-12-12T14:36:33+00:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)

I am assuming it's a timing issue, but not 100% sure.

dtest commented 9 years ago

Simply executing 'setup' on the instance again worked fine.

jonathandietz commented 9 years ago

I think its safe to assume that the stack should be started in the appropriate order and if dependencies are not available, services won't start.

Boot order should be handled by the Cloudformation script at the start, but if you turn off all of the servers, you should know which order to bring them back online. Labeling as "Won't Fix". Let me know if you think it still should be.