chrisschwa commented 11 months ago

RabbitMQ Connections failing critically after Control Nodes Reboot

OISM Version: 6.0.0 Original Installation was 5.2.0

After we rebooted all 3 Control Nodes (One by One) we had massive Issues with the complete Openstack Environment. Nearly every Subservice (Nova, Cinder, Designate in particular, but some others aswell) got Timeouts and Connection Issues on all 3 RabbitMQs.

Following Errors we got

Unexpected exception in API method: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 850aef05e4ce413e8dc3215cd3e8bff5
2023-10-05 08:56:51.983 735 ERROR nova.api.openstack.wsgi Traceback (most recent call last):
2023-10-05 08:56:51.983 735 ERROR nova.api.openstack.wsgi   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 441, in get
2023-10-05 08:56:51.983 735 ERROR nova.api.openstack.wsgi     return self._queues[msg_id].get(block=True, timeout=timeout)
2023-10-05 08:56:51.983 735 ERROR nova.api.openstack.wsgi   File "/var/lib/kolla/venv/lib/python3.10/site-packages/eventlet/queue.py", line 322, in get
2023-10-05 08:56:51.983 735 ERROR nova.api.openstack.wsgi     return waiter.wait()
2023-10-05 08:56:51.983 735 ERROR nova.api.openstack.wsgi   File "/var/lib/kolla/venv/lib/python3.10/site-packages/eventlet/queue.py", line 141, in wait

and some things like this

[d8f8a0b8-3eaf-43d0-875e-1d0eb7d2965e] AMQP server on 10.115.0.21:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer

What did NOT fix the issue

Rebooting again
Restarting RabbitMQ Containers
Restarting Subservice Containers

What did we Check

Cluster_Status on RabbitMQ was Fine No general Network issues Manually trying the connection worked perfectly

What DID fix the Issue

We deleted 2 of the 3 RabbitMQ Containers from the Control Nodes, and after about 2 Minutes of wait time, all Openstack Services started working again! Then we did a new osism apply rabbitmq on the Manager to redeploy the RabbitMQs we just deleted.

After that all Services worked fine again, and cluster_status was fine again aswell.

Since then we did not see any further issues about that topic. (But we did not reboot yet again)

maliblatt commented 11 months ago

JFTR: We also had two Regions with exactly the problem mentioned above. In our case we "fixed" it by purging the rabbitmq completely from all three controllers and redeployed it afterwards. Then everything seems to be running fine again. In the meantime there were produces some hanging cinder images which we had to fix manually. Even after rebooting the whole environment the rabbit seems to be running stable now.

@chrisschwa: just for interest, what is your setting for 'om_enable_rabbitmq_high_availability'? In our case we are setting to "no" in the kolla configuration.

chrisschwa commented 11 months ago

Hey @maliblatt, Thanks for hitting me up. Yes we have the value on "false" aswell.

Nils98Ar commented 11 months ago

We had the same problems. In my understanding it was due to rabbitmq in kolla was configured to mirror transient queues. See https://www.stackhpc.com/rabbitmq-reliability.html.

This was fixed in at least in zed (maybe before) but the upgrade alone was not enough to get rid of the faulty config apparently. We also fixed it by deleting rabbitmq containers + volumes, re-deploying rabbitmq completely and restarting all „rabbitmq client containers“.

Another option would be probably doing the steps described here but setting om_enable_rabbitmq_high_availability to false: https://docs.openstack.org/kolla-ansible/latest/reference/message-queues/rabbitmq.html#high-availability.

We consider enabling enabling rabbitmq ha (om_enable_rabbitmq_high_availability to true) when kolla supports quorum queues. Not sure if this has been already implemented in the meantime.

osism / issues

Document how to configure & check HA/durable queues in RabbitMQ #694

RabbitMQ Connections failing critically after Control Nodes Reboot

What did NOT fix the issue

What did we Check

What DID fix the Issue