Problems

Problems that exist within the system, that we need to live with because changing them would require major redesign.

neutron-ha-tool at the moment is not really monitoring the connection between routers but rather the agent's reported status. For short outages in the management network, or rabbit outages causing agents to fail to send the heartbeat - the tool might falsely decide to move routers.
not all routers are equal - a small number of routers are responsible for the majority of the load

Goals

make sure that the tool is not causing more trouble than help, it is not amplifying small outages
make sure busy routers are treated with care - they do not land on the same agent
neutron ha tool has to be made aware of the actual environment
- some routers are responsible for the most of the traffic
  - special treatment for busy routers
    - check load on target agent before migrating busy router - on the first iteration it can be as simple as counting the busy routers per agent, and making sure busy routers are evenly distributed.
    - can we know in advance which are the busy routers - can we tag them?
    - maybe having placement preferences for routers (Router 1 should live in agent1 or agent2) explicitly hinting the scheduling of routers

Metrics

The tool has to know the layout of the routers and agents somehow.

How do we know how busy an agent is?
- maybe we should treat an agent busy if it already has one busy router.
How do we know if a router is busy router?
- is it a fixed list?
- what if landscape is re-built?
- can we do some real-time measurements, like if a router has more than X ports, we treat it as a busy one?
- shall we assign some metric to the busyness? Each landscape has one busy router, but the prod landscape is the most important.

Scheduling

When we found that an agent needs to be evacuated, how do we use the above metrics to schedule router migrations?

If we move most busy routers first, we might amplify a small outage, and cause more trouble.

Actions

[x] find out how frequently does the agent report heartbeat
- See report_interval and agent_down_time. report_interval defaults to 30s and agent_down_time defaults to 75s.
[ ] find out a way to associate attributes to routers - we need to associate data to routers that help hinting the scheduling of routers when outage happened. Previously we wanted to use the description field for marking the AZ of the router, but we might want to give some structure to the description.
[x] periodically check for agent liveness while migrating routers - this would prevent short rabbit outages to cause moving a big number of routers
- A separate issue has been created: #12
[ ] make sure busy routers are distributed evenly across agents as possible

matelakat commented 7 years ago

Omitting the --now switch for neutron-ha-tool would mean that a random time period between 30-60 seconds would be spent waiting for the agent to become alive again. This might be enough for small outages.

matelakat commented 7 years ago

If report_interval is 30s and report_interval is 75s, in a worst case scenario, with a 31s outage, we miss two reports, thus the agent will be seen as dead for a short time (90 - 75 = 15) seconds

matelakat commented 7 years ago

Meeting 2017-05-16

How do we identify the high traffic routers? By tenant id. How do we know the tenant ids? A plain text file seems to be the easiest. How to distribute them? When to move, should it have a priority? When agent is down, we should not prioritize these high traffic routers. The only thing we need to take care is that they don't end up on the same network node. What happens if we have more high traffic routers than the number of network nodes? At the moment we have more agents than routers, so this should not be a problem. If we have more routers than agents, we shall send a WARNING that an agent will have more routers. How do we get the list of tenants? Maybe use crowbar for storing this information?

aspiers commented 7 years ago

I guess I need a voice discussion to catch up on this, but a few comments:

I'm a bit wary of us investing too much in trying to fix an HA approach which is fundamentally limited. It might not be significantly harder to backport the upstream l3_ha approach which monitors router health at the correct point, rather than via the API which depends on the health of the MQ.
That said, I talked to @armando-migliaccio in Boston about L3 HA, and during that conversation it occurred to us that even the upstream approach is just as dumb as ours currently, in that it will not reschedule failed routers intelligently onto agents based on how heavy the traffic or other load is on that machine, but instead either randomly, or naively based on the number of routers on the machine.
If we try to solve the "don't migrate routers to busy hosts" problem before we switch to l3_ha, we're at high risk of scope creep to the extent that it could end up stepping on the toes of projects like ceilometer / monasca / aodh.
If we instead switch to l3_ha before trying to solve that problem, I wonder how hard it would be to add a new scheduler which took into account traffic on the node. Of course this would require obtaining useful metrics from somewhere, which would probably mean a driver-based approach for metric collection which could support not only services like ceilometer but also metrics collected locally by non-OpenStack mechanisms like sysstat (sar) or even Mark Seger's collectl.

/cc @rossella in case she wants to comment on any of this :)

armando-migliaccio commented 7 years ago

@aspiers: I was referring to the option allow_automatic_l3agent_failover, introduced nearly three years ago now rather than L3 HA, where no rescheduling takes place when a host goes down.

aspiers commented 7 years ago

@armando-migliaccio OK I guess I misunderstood you then. allow_automatic_l3agent_failover is covered by column B of the spreadsheet I compiled in Atlanta, so I was already aware of the many issues with that approach - they're very similar to the list of issues with our approach in SOC, which is roughly the same except with monitoring and recovery handled via Pacemaker).

But it looks to me that there is still the weakness with the more modern l3_ha approach (columns C--E) that it could reschedule routers from a failed node onto another node which is already running at or near capacity with respect to L3 traffic. That could effectively DoS the second node, which could cause yet another failure. In that sense isn't there the potential for a domino effect causing an avalanche of cascading failures?

armando-migliaccio commented 7 years ago

@aspiers: how is rescheduling taking place in any of the approach C-E? I am not aware of any. That said, I agree that any rescheduling attempt that takes no consideration of load can be problematic.

matelakat commented 7 years ago

@aspiers Given that it seems that allow_automatic_l3agent_failover comes with the same set of issues as this tool, do you still think we shall consider switching to that? I guess of we did that, we would need to patch the code to accomodate the requirement of not placing two busy routers on the same agent.

aspiers commented 7 years ago

@armando-migliaccio This is probably just my lack of experience with Neutron internals showing. I was previously thinking that when an HA router fails, a rescheduling would occur, but now I think more carefully about it, the agent(s) acting as standby for the active router are presumably already scheduled in advance of the failure, so that they form part of the keepalived cluster before anything goes wrong. So yes, I was talking crap to some extent :-) ...

But incorrect terminology aside, doesn't my concern still apply? If agents A and B are scheduled to handle router X, with the router starting off as being active on A, and B as the hot standby, isn't it possible that by the time A fails, B has had other routers scheduled and become active on it, potentially to the point that B's data plane traffic is already near its maximum capacity? In which case the failover of X from A to B could tip B over the edge? Or am I still talking crap?

@matelakat Not sure why we would want to invest effort in replacing our solution with something which is at least if not more broken?

matelakat commented 7 years ago

@aspiers yes, that's what I was thinking as well - so we'll go down the path of adding some intelligence for SAP to support busy router placement.

armando-migliaccio commented 7 years ago

@aspiers: yes it does. If the cloud is hovering over the edge, it's very likely that a failure is gonna tip over the remaining nodes over the edge, but hey...if that happens and the admin has not been diligent in watching his/her monitoring tools, then that might as well be self inflicted ;)

bmaeck commented 7 years ago

Hi, we will stop working on this issue for different reason.

Original problem was not the router placement, but the thrashing and garbage collection of the ARP table
with our new approach to go for DVT situation will change anyway
It's too SAP specific and complicated to implement

sap-oc / cookbook-openstack-network

Neutron-ha-tool improvements #9

Problems

Goals

Metrics

Scheduling

Actions

Meeting 2017-05-16