matelakat commented 7 years ago

On 22 Apr (Saturday) a network outage happened. We suspect neutron-ha-tool moved some of the high-traffic routers to network node 1, which became unstable due to the high load. On 24 Apr some routers have been moved off network node 1, and that seemed to make the landscape stable.

Investigation

[x] Find out what triggered the move of the router on 22 Apr - was it neutron-ha-tool?
The control cluster was performed a restart of the rabbitmq at the 22 Apr 10:55 UTC. At this point all services were lose there connection to the messaging service. The neutron l3-agents were one of them. After the successful restart of the rabbitmq the l3-agent reconnect to the service. In that moment the neutron-ha-tool was checked the status of the l3-agents, but not all of them were fully reconnected and online again. So the neutron-ha-tool trigger the migration of 121 router.
[x] Find out what caused rabbit downtime
no root cause found. See comment below on actions taken.

Takeaways

~~Only some routers are responsible for the traffic~~ (This is the topic of #9)
~~Using the number of routers per agent is not a good balancing strategy as that would not prevent all the chatty routers to be hosted by a single agent~~ (topic of #9)
~~Make it clear within the help text of neutron-ha-tool that the HOST parameter of --l3-agent-evacuate refers to a host name, not a host UUID.~~ (separate issue created for this, #13 )
~~If we are using a router list by saying --router-list-file and the routers are not found, make sure that the user is notified accordingly.~~ (#18 Covers this)

~~When migrating routers, ports might not become active within the time available. In this case the following stacktrace will be printed:~~ (#19 Covers this item)

2017-04-24 09:28:05,093 neutron-ha-tool ERROR    Failed to migrate router=61aea97d-4711-4fb3-8fa8-43b9fa4503d9 from agent=6443bcf9-18ae-456a-b340-fcc565c0cd67 to agent=b3b9971c-24c2-4b80-91b5-4d103c261a81
Traceback (most recent call last):
File "/usr/bin/neutron-ha-tool", line 627, in migrate_router_safely
wait_for_router, delete_namespace)
File "/usr/bin/neutron-ha-tool", line 671, in migrate_router
wait_router_migrated(qclient, router_id, target['host'])
File "/usr/bin/neutron-ha-tool", line 732, in wait_router_migrated
(router_id, ", ".join(remaining_ports)))
RuntimeError: Some ports are not ACTIVE on router_id=61aea97d-4711-4fb3-8fa8-43b9fa4503d9: [8820e14e-b362-4883-8e36-5b09b2e6f112]
2017-04-24 09:28:05,094 neutron-ha-tool INFO     0 routers were evacuated from L3 agent d00-25-b5-a0-03-63
2017-04-24 09:28:05,094 neutron-ha-tool ERROR    1 errors encountered during evacuation

this information is clearly not representing that this is a timeout, and that this might not be an error after all

matelakat commented 7 years ago

After looking for some clues what made pacemaker think that rabbit is not running, we found no evidence. The action we took is we log the output of rabbitmq status check. See this issue: https://github.com/sap-oc/crowbar-openstack/issues/35

matelakat commented 7 years ago

The outage of rabbit followed from the pacemaker logs:

Apr 22 10:55:47 [4629] d00-25-b5-a0-00-b9       crmd:     info: process_lrm_event:      Operation rabbitmq_monitor_10000: not running (node=d00-25-b5-a0-00-b9
Apr 22 10:56:19 [4629] d00-25-b5-a0-00-b9       crmd:     info: process_lrm_event:      Operation rabbitmq_monitor_10000: ok (node=d00-25-b5-a0-00-b9, call=648, rc=0, cib-update=830, confirmed=false)

Which is 32 seconds

matelakat commented 7 years ago

Closing this down as spin off cards have been created.

sap-oc / cookbook-openstack-network

[placeholder] network issues on 22 Apr #8

Investigation

Takeaways