sap-oc / cookbook-openstack-network

Chef Cookbook - OpenStack Network
http://openstack.org
0 stars 0 forks source link

Neutron-ha-tool improvements #9

Closed matelakat closed 7 years ago

matelakat commented 7 years ago

Discussion on neutron-ha-tool

Problems

Problems that exist within the system, that we need to live with because changing them would require major redesign.

Goals

Metrics

The tool has to know the layout of the routers and agents somehow.

Scheduling

When we found that an agent needs to be evacuated, how do we use the above metrics to schedule router migrations?

Actions

matelakat commented 7 years ago

Omitting the --now switch for neutron-ha-tool would mean that a random time period between 30-60 seconds would be spent waiting for the agent to become alive again. This might be enough for small outages.

matelakat commented 7 years ago

If report_interval is 30s and report_interval is 75s, in a worst case scenario, with a 31s outage, we miss two reports, thus the agent will be seen as dead for a short time (90 - 75 = 15) seconds

matelakat commented 7 years ago

Meeting 2017-05-16

How do we identify the high traffic routers? By tenant id. How do we know the tenant ids? A plain text file seems to be the easiest. How to distribute them? When to move, should it have a priority? When agent is down, we should not prioritize these high traffic routers. The only thing we need to take care is that they don't end up on the same network node. What happens if we have more high traffic routers than the number of network nodes? At the moment we have more agents than routers, so this should not be a problem. If we have more routers than agents, we shall send a WARNING that an agent will have more routers. How do we get the list of tenants? Maybe use crowbar for storing this information?

aspiers commented 7 years ago

I guess I need a voice discussion to catch up on this, but a few comments:

  1. I'm a bit wary of us investing too much in trying to fix an HA approach which is fundamentally limited. It might not be significantly harder to backport the upstream l3_ha approach which monitors router health at the correct point, rather than via the API which depends on the health of the MQ.

  2. That said, I talked to @armando-migliaccio in Boston about L3 HA, and during that conversation it occurred to us that even the upstream approach is just as dumb as ours currently, in that it will not reschedule failed routers intelligently onto agents based on how heavy the traffic or other load is on that machine, but instead either randomly, or naively based on the number of routers on the machine.

  3. If we try to solve the "don't migrate routers to busy hosts" problem before we switch to l3_ha, we're at high risk of scope creep to the extent that it could end up stepping on the toes of projects like ceilometer / monasca / aodh.

  4. If we instead switch to l3_ha before trying to solve that problem, I wonder how hard it would be to add a new scheduler which took into account traffic on the node. Of course this would require obtaining useful metrics from somewhere, which would probably mean a driver-based approach for metric collection which could support not only services like ceilometer but also metrics collected locally by non-OpenStack mechanisms like sysstat (sar) or even Mark Seger's collectl.

/cc @rossella in case she wants to comment on any of this :)

armando-migliaccio commented 7 years ago

@aspiers: I was referring to the option allow_automatic_l3agent_failover, introduced nearly three years ago now rather than L3 HA, where no rescheduling takes place when a host goes down.

aspiers commented 7 years ago

@armando-migliaccio OK I guess I misunderstood you then. allow_automatic_l3agent_failover is covered by column B of the spreadsheet I compiled in Atlanta, so I was already aware of the many issues with that approach - they're very similar to the list of issues with our approach in SOC, which is roughly the same except with monitoring and recovery handled via Pacemaker).

But it looks to me that there is still the weakness with the more modern l3_ha approach (columns C--E) that it could reschedule routers from a failed node onto another node which is already running at or near capacity with respect to L3 traffic. That could effectively DoS the second node, which could cause yet another failure. In that sense isn't there the potential for a domino effect causing an avalanche of cascading failures?

armando-migliaccio commented 7 years ago

@aspiers: how is rescheduling taking place in any of the approach C-E? I am not aware of any. That said, I agree that any rescheduling attempt that takes no consideration of load can be problematic.

matelakat commented 7 years ago

@aspiers Given that it seems that allow_automatic_l3agent_failover comes with the same set of issues as this tool, do you still think we shall consider switching to that? I guess of we did that, we would need to patch the code to accomodate the requirement of not placing two busy routers on the same agent.

aspiers commented 7 years ago

@armando-migliaccio This is probably just my lack of experience with Neutron internals showing. I was previously thinking that when an HA router fails, a rescheduling would occur, but now I think more carefully about it, the agent(s) acting as standby for the active router are presumably already scheduled in advance of the failure, so that they form part of the keepalived cluster before anything goes wrong. So yes, I was talking crap to some extent :-) ...

But incorrect terminology aside, doesn't my concern still apply? If agents A and B are scheduled to handle router X, with the router starting off as being active on A, and B as the hot standby, isn't it possible that by the time A fails, B has had other routers scheduled and become active on it, potentially to the point that B's data plane traffic is already near its maximum capacity? In which case the failover of X from A to B could tip B over the edge? Or am I still talking crap?

@matelakat Not sure why we would want to invest effort in replacing our solution with something which is at least if not more broken?

matelakat commented 7 years ago

@aspiers yes, that's what I was thinking as well - so we'll go down the path of adding some intelligence for SAP to support busy router placement.

armando-migliaccio commented 7 years ago

@aspiers: yes it does. If the cloud is hovering over the edge, it's very likely that a failure is gonna tip over the remaining nodes over the edge, but hey...if that happens and the admin has not been diligent in watching his/her monitoring tools, then that might as well be self inflicted ;)

bmaeck commented 7 years ago

Hi, we will stop working on this issue for different reason.