Closed matelakat closed 7 years ago
After looking for some clues what made pacemaker think that rabbit is not running, we found no evidence. The action we took is we log the output of rabbitmq status check. See this issue: https://github.com/sap-oc/crowbar-openstack/issues/35
The outage of rabbit followed from the pacemaker logs:
Apr 22 10:55:47 [4629] d00-25-b5-a0-00-b9 crmd: info: process_lrm_event: Operation rabbitmq_monitor_10000: not running (node=d00-25-b5-a0-00-b9
Apr 22 10:56:19 [4629] d00-25-b5-a0-00-b9 crmd: info: process_lrm_event: Operation rabbitmq_monitor_10000: ok (node=d00-25-b5-a0-00-b9, call=648, rc=0, cib-update=830, confirmed=false)
Which is 32 seconds
Closing this down as spin off cards have been created.
On 22 Apr (Saturday) a network outage happened. We suspect neutron-ha-tool moved some of the high-traffic routers to network node 1, which became unstable due to the high load. On 24 Apr some routers have been moved off network node 1, and that seemed to make the landscape stable.
Investigation
Takeaways
Only some routers are responsible for the traffic(This is the topic of #9)Using the number of routers per agent is not a good balancing strategy as that would not prevent all the chatty routers to be hosted by a single agent(topic of #9)Make it clear within the help text of neutron-ha-tool that the(separate issue created for this, #13 )HOST
parameter of--l3-agent-evacuate
refers to a host name, not a host UUID.If we are using a router list by saying(#18 Covers this)--router-list-file
and the routers are not found, make sure that the user is notified accordingly.When migrating routers, ports might not become active within the time available. In this case the following stacktrace will be printed:(#19 Covers this item)this information is clearly not representing that this is a timeout, and that this might not be an error after all