[bug] RabbitMQ communication issues

killermoehre commented 1 month ago

OSISM release version

7.1.2

What's the problem?

At our customer, we had today an outage caused by services not able to communicate via rabbitmq.

Due to reasons, we had to reboot a control node to clear some mount issues. This obviously caused some services to lose their connection to one of the rabbitmq container.

But even after the connection timeout the services were not able to send or receive messages with the remaining two nodes. Only after we restarted all the rabbitmq-Container and purged all the queues communication were possible again.

I saw https://github.com/osism/openstack-health-middleware/, which would probably provide some monitoring, but here is the question if this is already available.

References to existing reports

References to existing bug reports, mailing lists, ...

Severity

high

Urgency

high

berendt commented 1 month ago

Are the quorum queues in use on the cluster? If not, the recommendation would be to switch RabbitMQ to quorum queues and adjust the configuration of the OpenStack services. This should prevent problems when a single control node is started.

openstack-health-middleware is intended to have a more meaningful health check of the Kolla containers in the future.

berendt commented 1 month ago

Add the following parameters in environments/kolla/configuration.yml (can be removed again after release of OSISM 8.1.0):

om_enable_rabbitmq_high_availability: false
om_enable_rabbitmq_quorum_queues: true

Migration (documentation will be added soon):

osism apply -a stop X (for all deployed OpenStack services)
osism apply -a config X (for all deployed OpenStack services)
osism apply -a reconfigure rabbitmq
osism app rabbitmq-reset-state
osism apply -a deploy X (for all deployed OpenStack services) (there will be a start action for OSISM >= 8.1.0)

berendt commented 1 month ago

Kolla start action: https://review.opendev.org/c/openstack/kolla-ansible/+/932311

osism / issues