Open killermoehre opened 1 month ago
Are the quorum queues in use on the cluster? If not, the recommendation would be to switch RabbitMQ to quorum queues and adjust the configuration of the OpenStack services. This should prevent problems when a single control node is started.
openstack-health-middleware is intended to have a more meaningful health check of the Kolla containers in the future.
Related to https://github.com/osism/defaults/pull/235
Add the following parameters in environments/kolla/configuration.yml
(can be removed again after release of OSISM 8.1.0):
om_enable_rabbitmq_high_availability: false
om_enable_rabbitmq_quorum_queues: true
Migration (documentation will be added soon):
Kolla start action: https://review.opendev.org/c/openstack/kolla-ansible/+/932311
OSISM release version
7.1.2
What's the problem?
At our customer, we had today an outage caused by services not able to communicate via
rabbitmq
.Due to reasons, we had to reboot a control node to clear some mount issues. This obviously caused some services to lose their connection to one of the
rabbitmq
container.But even after the connection timeout the services were not able to send or receive messages with the remaining two nodes. Only after we restarted all the
rabbitmq
-Container and purged all the queues communication were possible again.I saw https://github.com/osism/openstack-health-middleware/, which would probably provide some monitoring, but here is the question if this is already available.
References to existing reports
References to existing bug reports, mailing lists, ...
Severity
high
Urgency
high