openvstorage / openvstorage-health-check

The health check is classified as a monitoring and detection tool for Open vStorage.
3 stars 7 forks source link

Check if a rabbitmq is not running #485

Open jeroenmaelbrancke opened 6 years ago

jeroenmaelbrancke commented 6 years ago

The rabbitmq tests on the healthcheck only verify if there is a partition problem or not. If one node is not running the healthcheck doesn't report this.

If the queue of a particular worker is not running the workers are not able to proceed and ending up with following error:

Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]:     nowait=nowait)
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]:   File "/usr/lib/python2.7/dist-packages/amqp/channel.py", line 1256, in queue_declare
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]:     (50, 11),  # Channel.queue_declare_ok
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]:   File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 69, in wait
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]:     return self.dispatch_method(method_sig, args, content)
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]:   File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 87, in dispatch_method
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]:     return amqp_method(self, args)
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]:   File "/usr/lib/python2.7/dist-packages/amqp/channel.py", line 243, in _close
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]:     reply_code, reply_text, (class_id, method_id), ChannelError,
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]: NotFound: Queue.declare: (404) NOT_FOUND - home node 'rabbit@ds1-stor-02' of durable queue 'ovs_1cu3dqsKJReUqgJK' in vhost '/' is down or inaccessible

There is also a bug in RabbitMQ if autoheal kicks in that the node crash. https://github.com/rabbitmq/rabbitmq-server/issues/928