The rabbitmq tests on the healthcheck only verify if there is a partition problem or not.
If one node is not running the healthcheck doesn't report this.
If the queue of a particular worker is not running the workers are not able to proceed and ending up with following error:
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]: nowait=nowait)
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]: File "/usr/lib/python2.7/dist-packages/amqp/channel.py", line 1256, in queue_declare
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]: (50, 11), # Channel.queue_declare_ok
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]: File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 69, in wait
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]: return self.dispatch_method(method_sig, args, content)
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]: File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 87, in dispatch_method
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]: return amqp_method(self, args)
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]: File "/usr/lib/python2.7/dist-packages/amqp/channel.py", line 243, in _close
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]: reply_code, reply_text, (class_id, method_id), ChannelError,
Aug 22 11:35:41 ds1-stor-05.ds1 celery[38956]: NotFound: Queue.declare: (404) NOT_FOUND - home node 'rabbit@ds1-stor-02' of durable queue 'ovs_1cu3dqsKJReUqgJK' in vhost '/' is down or inaccessible
The rabbitmq tests on the healthcheck only verify if there is a partition problem or not. If one node is not running the healthcheck doesn't report this.
If the queue of a particular worker is not running the workers are not able to proceed and ending up with following error:
There is also a bug in RabbitMQ if autoheal kicks in that the node crash. https://github.com/rabbitmq/rabbitmq-server/issues/928