Open berendt opened 11 months ago
JFTR: We also had two Regions with exactly the problem mentioned above. In our case we "fixed" it by purging the rabbitmq completely from all three controllers and redeployed it afterwards. Then everything seems to be running fine again. In the meantime there were produces some hanging cinder images which we had to fix manually. Even after rebooting the whole environment the rabbit seems to be running stable now.
@chrisschwa: just for interest, what is your setting for 'om_enable_rabbitmq_high_availability'? In our case we are setting to "no" in the kolla configuration.
Hey @maliblatt, Thanks for hitting me up. Yes we have the value on "false" aswell.
We had the same problems. In my understanding it was due to rabbitmq in kolla was configured to mirror transient queues. See https://www.stackhpc.com/rabbitmq-reliability.html.
This was fixed in at least in zed (maybe before) but the upgrade alone was not enough to get rid of the faulty config apparently. We also fixed it by deleting rabbitmq containers + volumes, re-deploying rabbitmq completely and restarting all „rabbitmq client containers“.
Another option would be probably doing the steps described here but setting om_enable_rabbitmq_high_availability to false: https://docs.openstack.org/kolla-ansible/latest/reference/message-queues/rabbitmq.html#high-availability.
We consider enabling enabling rabbitmq ha (om_enable_rabbitmq_high_availability to true) when kolla supports quorum queues. Not sure if this has been already implemented in the meantime.
RabbitMQ Connections failing critically after Control Nodes Reboot
OISM Version: 6.0.0 Original Installation was 5.2.0
After we rebooted all 3 Control Nodes (One by One) we had massive Issues with the complete Openstack Environment. Nearly every Subservice (Nova, Cinder, Designate in particular, but some others aswell) got Timeouts and Connection Issues on all 3 RabbitMQs.
Following Errors we got
and some things like this
What did NOT fix the issue
What did we Check
Cluster_Status on RabbitMQ was Fine No general Network issues Manually trying the connection worked perfectly
What DID fix the Issue
We deleted 2 of the 3 RabbitMQ Containers from the Control Nodes, and after about 2 Minutes of wait time, all Openstack Services started working again! Then we did a new
osism apply rabbitmq
on the Manager to redeploy the RabbitMQs we just deleted.After that all Services worked fine again, and cluster_status was fine again aswell.
Since then we did not see any further issues about that topic. (But we did not reboot yet again)