rcbops / chef-cookbooks

RCB OPS - Chef Cookbooks
Other
118 stars 102 forks source link

nova-network: 4.1.x, x>2: gateway_external_network_id absent in template, breaks L3 with multiple external nets #933

Open cbuben opened 10 years ago

cbuben commented 10 years ago

/cc @Apsu

https://github.com/rcbops-cookbooks/nova-network/commit/f37dc6e037450a1c6dd0d20e05f99185efa6124f

Is there a reason the router_id and gateway_external_network_id parameters were removed? We're still grizzly, are these no longer relevant in havana (I'm stuck in the past)?

Anyhow, here's what I really care about:

https://github.com/rcbops-cookbooks/nova-network/commit/4f54e1b19371e8d1659a92714df78c53ecdda2f6

This looks like the above commit was backported to the Grizzly series, post 4.1.2.

We have multiple provider networks, and use provider external network approach (quantum_external_bridge empty, gateway_external_network_id=net_id for specific external network).

We did a 4.1.2 to 4.1.5 upgrade, and lost L3 as gateway_external_network_id is no longer rendered in l3_agent.conf. The inability to handle this in the cookbooks makes the L3 agent fail (TooManyExternalNetworks) in any RPC deployment where multiple external networks are available, methinks.

claco commented 10 years ago

@cbuben /cc @Apsu I cancelled the Jenkins job until we understand why the change was made, and if it also needs to be reverted in master/havana. or if the original change should not have been backported to grizzly (in which case, we re run this PR job).

cbuben commented 10 years ago

Thanks, Chris! I commented much the same in the PR comments - I'm uncertain about the origin of the main change (when master was 4.2 series / havana), the reason for the backport, and whether it should be reverted on the 4.1 series, 4.2 series, or both.

claco commented 10 years ago

We shall find out. Thanks for the issue and the PR though. Much appreciated.

Apsu commented 10 years ago

I'm not sure I believe you that simply creating more than one network with --router:external=true will result in an error, which is what I understand you to be saying here. I'm building a fresh 4.1.5 cluster to see what the case is and I'll update this when I have an answer.

Apsu commented 10 years ago

I can address most of this right now, actually. Reading your comments more carefully, I see you're relying on the gateway_external_network_id directive in the config file (and Chef to put it there) to associate external networks with your routers. This is, in fact, not only a bad idea (due to inflexibility), not necessary (as Neutron has a simple, dynamic mechanism for this), but also breaks one of the primary designs and expectations of how we deploy and manage Neutron routers and networks.

Namely, that every single-network-node role host will have an l3-agent which is interchangeably capable of hosting every router in the cluster equally, with equal access to every internal/external network any router may need/use.

It sounds like what you're trying to do/expecting is to somehow differentially set the conf directive across different hosts to use different external networks. I can tell you for sure that's not going to work for lots of reasons, not the least of which being the primary design assumption stated above.

Given that, I suspect when my cluster is up shortly, that having multiple external networks will be perfectly fine (as I recall it always being), and in order to utilize them, you will simply need to create multiple routers and attach each external network to its own router. If you have more complicated internal/external routing needs, then we'll have to analyze the use-case and see what solutions present themselves and how that relates (or doesn't) to the narrow, opinionated design our cookbooks implement.

rcbjenkins commented 10 years ago

SHOW ME YOUR HISTORY!!

Sent from my iPhone

On Apr 15, 2014, at 10:15, "Evan Callicoat" notifications@github.com<mailto:notifications@github.com> wrote:

I'm not sure I believe you that simply creating more than one network with --router:external=true will result in an error, which is what I understand you to be saying here. I'm building a fresh 4.1.5 cluster to see what the case is and I'll update this when I have an answer.

— Reply to this email directly or view it on GitHubhttps://github.com/rcbops/irc/issues/933#issuecomment-40508349.

cbuben commented 10 years ago

Thanks for looking at this, Evan.

AFAIK: each L3 agent instance can deal with one and only one external network. This is a documented design limitation.

http://docs.openstack.org/havana/config-reference/content/adv_cfg_l3_agent_multi_extnet.html

"Since each L3 agent can be associated with at most one external network".

I understand and agree with everything you are saying about how things should be - this is a counterintuitive restriction. Every L3 agent instance should be able to host routers for any pair of <internal,external> networks. The association of a single external network with an L3 agent instance is a broken concept, <internal,external> should only matter on a per-router instance - not per-L3 agent instance - basis. However AFAIK that's not current state.

RPC chef deployment topology is going to constrain you to a single L3 agent per host, but hypothetically you could override the gateway_external_network_id on a per-node basis, then you'd need _numnetworks network nodes, each with an L3 agent, ick.

I'm not understanding the intent of removing this configuration. I dislike the design limitations that make gateway_external_network_id necessary in the first place, but why kill the ability to configure it? Are you saying: "with RPC, since we don't think you can override the gateway_external_network_id on a per-L3-agent basis, there will never be a way for you to handle more than one external network, and thus this configuration is unnecessary?"

Apsu commented 10 years ago

Ah, damn, you are indeed correct regarding one external network per l3-agent instance. I haven't had a need to try multiples in a while, nor has a customer which has been escalated to me lately, but I seemed to recall it being worked on/fixed some time ago. I found this: https://review.openstack.org/#/c/34192/ which is probably what I was thinking of, yet it got abandoned and I don't see much if any further work on the idea :/

So, to address the main point you raise and question you ask:

We currently design, build and attempt to manage/monitor network nodes/agents as fully interchangeable resources in the available pool -- in particular via rpcdaemon's behavior in assisting with failover and network/router scheduling amongst available dhcp/l3 agents, respectively.

I don't disagree that it's possible to, for example, disable rpcdaemon (or just the l3-agent plugin perhaps), create N network nodes to match N external networks, and hand-map their templated ID directives accordingly. Then, after Chef does the needeful, manually schedule the routers you create to the appropriate nodes' agents, still keeping the neutron router auto-scheduler disabled (as our cookbooks have done for a few point releases). But as you said, that's definitely an icky "solution".

Granted, there's always the possibility of configuring multiples of your own agents on a smaller handful of nodes, but you'd still be in the crappy position of Neutron's builtin failure-modes and scheduling -- which is pretty abysmal -- not to mention the fact that'd exist entirely outside of Chef.

I think in the end, there's an opportunity to solve this in Neutron itself, which is what should have happened a year ago, but there's also the possibility of solving it with a little Chef and perhaps a bit smarter rpcdaemon plugin. If we allowed for deploying 2+ l3-agents on any given node(s), bundled with appropriate process/service management, to point at their individual conf files, and if rpcdaemon could identify multiple agents for a given external-network-connected router, it could still handle the failover/scheduling intelligently for you, without trying to move routers between the "wrong" agents.

I'll have to think about this and lab up some ideas to see if a relatively quick win presents itself. Short of that or Neutron miracles or the node-per-network-minus-failover approach, I'll have to say yes, there's not a great way to do it and meet the HA/configuration design of RPC in it current form, thus this configuration is unnecessary. (Sorry, and you're welcome.)

pellaeon commented 10 years ago

I'd like to add some information, since Change Id260a239, developed in Icehouse cycle, single l3-agent should be able to handle mulltiple external networks, thought I have not tested this and is not sure if this is exactly what we want. I also need to have multiple external networks in my environment (Havana), this post outlines the way that I achieve it. (it is indeed hacky)

cbuben commented 10 years ago

/cc @Apsu

Hey Evan: after our offline discussion, where did we leave this? We were going to re-enable the ability to configure gateway_external_network_id, right?

Apsu commented 10 years ago

@cbuben Yeah, after our offline discussion, I am okay with re-enabling the ability, so at least the option is present, if you're willing to do the extra work of running multiple agents outside of our supported, opinionated "expected featureset". Since you already submitted a PR, we'll have to take a look at where it should be applied, or just make a new one with the right rebase, etc.

cbuben commented 10 years ago

@claco @Apsu

Thanks, Evan. So can we go ahead and merge this on master? How about a merge to v4.2.3rc?

claco commented 10 years ago

@cbuben At this very moment, we're not actively working on these cookbooks. I shall inquire.

cbuben commented 10 years ago

@claco @Apsu I'm confused - "not working on these cookbooks?" I mean, this is really the only path for us to evolve RPC, right? What other alternatives do I have? Should I be getting worried / twitchy? FWIW, this issue is a pain point and operational risk for us.

claco commented 10 years ago

@cbuben It simply means we're on other tasks at the moment, so I don't have a specific timeframe for the master merge of this PR or a PR down into v4.2.3rc.