kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.92k stars 4.65k forks source link

Support for private subnet instance groups with NAT Gateway #428

Closed tazjin closed 7 years ago

tazjin commented 8 years ago

After some discussions with @chrislovecnm I'm using this issue to summarise what we need to do to support instances in private subnets with NAT gateways.

Problem

Currently all instance groups created by kops are placed in public subnets. This may not be desirable in all use-cases. There are related open issues about this (#232, #266 which should maybe be closed, #220, #196).

As the simplest use-case kops should support launching instance groups into private subnets with Amazon's managed NAT gateways as the default route.

In addition a feature to specify a default route may be desirable for use-cases where NAT is handled differently, as suggested by @ProTip.

AWS resources

In order to set this up several resources are required. We need:

  1. At least one public subnet (can be a subnet in which we have public instance groups with nodes / masters)
  2. An Elastic IP to associate with each NAT gateway
  3. At least one NAT gateway resource in a public subnet, associated with an elastic IP.
  4. A route table and IGW entry for the public subnet (kops currently creates this).
  5. A route table and entry for sending traffic to the NAT gateway from the private subnet.
  6. Correct route table associations for each subnet.

    Open questions

    • [ ] Right now kops creates a single route-table and internet gateway for all subnets. We may need to split this up into at least a public and private route-table. Which routes must exist in both tables?
    • [ ] AWS NAT gateways are redundant by default, but only within their specified availability zone. Is it possible to specify multiple default routes as a sort of "load-balanced" NAT setup? Or should we have a route table per AZ with private subnets, with corresponding NAT gateway?

      Implementation

After the open questions are answered (and unless somebody else comes up with any new ones!) I think the implementation steps are roughly these:

Dump your thoughts in the comments and I'll update this as we go along. I'm willing to spend time on this but have limited Go experience, so if someone familiar with the code base has time to answer questions that may come up I'd be grateful.

chrislovecnm commented 8 years ago

cc: @kris-nova

oviis commented 8 years ago

Hi Guys, thanks a lot for managing this. First of all i love kops, kubernetes and AWS. :-)

I have create a sample kops cluster named "k8s-test-evironment-com" in 3 AWS AZ's(eu-west) and output this to terraform, then i have start to manage the routing in an extra file "isolated_cluster_sample.tf". Then i have need to answer the same questions as you before, but i have decide at the end, to create a NAT and EIP per AZ. The tricky path was to tag the route table WITHOUT THE "KubernetesCluster" tag!!! With this tag and 2 routes the k8s services network doesn't work!!! That cost me 2 days ;-)

My working sample terraform code for isolated nodes is an addition of the generated subnet by kops. Also the NAT gateways are allocated to the generated public networks. This sample can be used as a template to change the golang code for generation afterwords.

For testing, you need following steps:

start file "isolated_cluster_sample.tf" `#----------------------------------------

begin private subnets

generated VPC CIDR in this case "10.10.0.0/16"

AWS-AZS="eu-west-1a,eu-west-1b,eu-west-1c"

----------------------------------------

resource "aws_subnet" "eu-west-1a-k8s-test-evironment-com_private" { vpc_id = "${aws_vpc.k8s-test-evironment-com.id}" cidr_block = "10.10.128.0/19" availability_zone = "eu-west-1a" tags = { KubernetesCluster = "k8s.test.evironment.com" Name = "eu-west-1a.k8s.test.evironment.com" } }

resource "aws_subnet" "eu-west-1b-k8s-test-evironment-com_private" { vpc_id = "${aws_vpc.k8s-test-evironment-com.id}" cidr_block = "10.10.160.0/19" availability_zone = "eu-west-1b" tags = { KubernetesCluster = "k8s.test.evironment.com" Name = "eu-west-1b.k8s.test.evironment.com" } }

resource "aws_subnet" "eu-west-1c-k8s-test-evironment-com_private" { vpc_id = "${aws_vpc.k8s-test-evironment-com.id}" cidr_block = "10.10.192.0/19" availability_zone = "eu-west-1c" tags = { KubernetesCluster = "k8s.test.evironment.com" Name = "eu-west-1c.k8s.test.evironment.com" } }

-----------------------------------------

end private subnets

-----------------------------------------

-------------------------------------------------------

private nating begin

-------------------------------------------------------

resource "aws_eip" "nat-1a" { vpc = true lifecycle { create_before_destroy = true } }

resource "aws_eip" "nat-1b" { vpc = true lifecycle { create_before_destroy = true } }

resource "aws_eip" "nat-1c" { vpc = true lifecycle { create_before_destroy = true } }

resource "aws_nat_gateway" "gw-1a" { allocation_id = "${aws_eip.nat-1a.id}" subnet_id = "${aws_subnet.eu-west-1a-k8s-test-evironment-com.id}" }

resource "aws_nat_gateway" "gw-1b" { allocation_id = "${aws_eip.nat-1b.id}" subnet_id = "${aws_subnet.eu-west-1b-k8s-test-evironment-com.id}" }

resource "aws_nat_gateway" "gw-1c" { allocation_id = "${aws_eip.nat-1c.id}" subnet_id = "${aws_subnet.eu-west-1c-k8s-test-evironment-com.id}" }

-------------------------------------------------------

private nating end

-------------------------------------------------------

-------------------------------------------------------

private routing begin

-------------------------------------------------------

resource "aws_route" "0-0-0-0--nat-1a" { route_table_id = "${aws_route_table.k8s-test-evironment-com_private_1a.id}" destination_cidr_block = "0.0.0.0/0" nat_gateway_id = "${aws_nat_gateway.gw-1a.id}" }

resource "aws_route" "0-0-0-0--nat-1b" { route_table_id = "${aws_route_table.k8s-test-evironment-com_private_1b.id}" destination_cidr_block = "0.0.0.0/0" nat_gateway_id = "${aws_nat_gateway.gw-1b.id}" }

resource "aws_route" "0-0-0-0--nat-1c" { route_table_id = "${aws_route_table.k8s-test-evironment-com_private_1c.id}" destination_cidr_block = "0.0.0.0/0" nat_gateway_id = "${aws_nat_gateway.gw-1c.id}" }

resource "aws_route_table" "k8s-test-evironment-com_private_1a" { vpc_id = "${aws_vpc.k8s-test-evironment-com.id}" tags = { Name = "k8s.test.evironment.com_private" } }

resource "aws_route_table" "k8s-test-evironment-com_private_1b" { vpc_id = "${aws_vpc.k8s-test-evironment-com.id}" tags = { Name = "k8s.test.evironment.com_private" } }

resource "aws_route_table" "k8s-test-evironment-com_private_1c" { vpc_id = "${aws_vpc.k8s-test-evironment-com.id}" tags = { Name = "k8s.test.evironment.com_private" } }

resource "aws_route_table_association" "eu-west-1a-k8s-test-evironment-com_private" { subnet_id = "${aws_subnet.eu-west-1a-k8s-test-evironment-com_private.id}" route_table_id = "${aws_route_table.k8s-test-evironment-com_private_1a.id}" }

resource "aws_route_table_association" "eu-west-1b-k8s-test-evironment-com_private" { subnet_id = "${aws_subnet.eu-west-1b-k8s-test-evironment-com_private.id}" route_table_id = "${aws_route_table.k8s-test-evironment-com_private_1b.id}" }

resource "aws_route_table_association" "eu-west-1c-k8s-test-evironment-com_private" { subnet_id = "${aws_subnet.eu-west-1c-k8s-test-evironment-com_private.id}" route_table_id = "${aws_route_table.k8s-test-evironment-com_private_1c.id}" }

-------------------------------------------------------

private routing end

-------------------------------------------------------

`

MrTrustor commented 8 years ago

Hello!

Thank you for the work done so far. This is far better than everything I've seen so far for spinning up K8s clusters on AWS.

I would very much like to see this implemented. My company uses a predefined, well designed, network topology that kind of matches this. AWS also recommends this kind of setups.

AWS published a CloudFormation stack that implements a neat network topology:

To sum it up here:

I would also like to contribute to this project, so I'd be happy to take on a part of the work linked to this issue.

chrislovecnm commented 8 years ago

@tazjin - few questions for you

In regards to the route table. Because we are not using an overlay network we are utilizing that routing to communicate between az. Which limits us to 50 severs total.

We need to think full HA with 3+ masters and multiple AZ. Only way to roll ;)

tazjin commented 8 years ago

@chrislovecnm Hi!

Can we get a PR in with a design doc?

Yes, I'm hoping to find some time for that this week.

N00b question - do we have to have a nat gw?

We need something in a public subnet that can NAT the private traffic to the internet (this is assuming people want their clusters to be able to access the internet!)

How do we design ingress and loadbalancer with private IP space?

Not sure about ingress (depends on the ingress type I suppose?) but normal type: LoadBalancer should be able to handle target instances in private subnets.

How do we route the API server?

There's several options and this needs discussing. For example:

  1. Public API server (like right now). Potentially configured to enforce valid client certificates (I believe it isn't right now!)
  2. kops is hands-off - it's up to the user to forward their traffic into the private subnet.
  3. Some sort of tunnel (e.g. SSH-based) into where the master is running.

We are putting in a validate option into kops, how would someone validate the cluster?

I'm not familiar with what that option will do, so no clue!

MrTrustor commented 8 years ago

Just to answer this question:

Do we need a bastion server as well with the same ssh pem?

If the machine you want to SSH in is in a private subnet, you can setup an ELB to forward the trafic of port 22 to this server.

chrislovecnm commented 8 years ago

cc: @ajayamohan

jkemp101 commented 8 years ago

I propose that as the first step kops is made to work in an existing environment (shared VPC) using existing NGWs. As a minimum first set of requirements I would think we need to:

  1. Create a route table for each zone so we are HA.
  2. Specify an existing NAT Gateway(s) to be used in each route table.
  3. Change existing route management to manage multiple route tables instead of the single table used today when nodes are added/removed.

I think having kops spin up an entire environment including public and private subnets with IGWs, NAT Gateways, bastions, proper Security Groups, etc. is asking a lot of kops. Making this work with existing infrastructure as the first step and then adding onto it if necessary is potentially a cleaner path than doing it the other way around. For instance, I already have 3 NGWs and wouldn't want to have to pay for 3 more if kops created them automatically.

chulkilee commented 8 years ago

I'm new to kops and don't know how kops handle infrastructure (not k8s), but I think it would be nice if I can use kops to deploy k8s cluster on all infrastructure (set up with other tools) by giving all information.

For example, if kops has separates steps for infrastructure setup and k8s cluster creation, it would be easier to test this.

Also I think it would be better to start with single-AZ set up (no cross-AZ cluster / no HA on the top of multiple AZ).

chrislovecnm commented 8 years ago

@tazjin one of the interesting things that @justinsb just brought up was that this may need to use overlay networking for HA to function. We would need a bastion connection in each AZ otherwise, which seems a tad complicated. Thoughts?

@chulkilee I understand that it would be simpler, but non-HA is not useable for us in production. Also we need to probably address overlay networking as well.

chulkilee commented 8 years ago

@chrislovecnm kops should support HA eventually - what I'm saying is that for this new feature, kops may support the simple use case at first. I don't know what's the most used deployment scenario for HA (e.g. HA over multiple AZ, HA in single AZ, or leveraging federation) and which options kops supports..

MrTrustor commented 8 years ago

@chulkilee The advertised goal of kops is to set up a production cluster. IMHO, a cluster cannot be production-ready if it is not HA. On AWS, if every service is not hosted concurrently on at least 2 AZs, you don't have HA.

@chrislovecnm @justinsb I'm not sure I understand why the overlay networking would be mandatory for HA to function: routing between AZs and subnets, in a given VPC, is pretty transparent in AWS.

jkemp101 commented 8 years ago

I drew a picture of what I am currently testing. It may help with the discussion.

Everything seems to be workings so far but there is one HA deficiency. Internally sourced outbound traffic is routed through a single NGW, NGW A in the diagram. To fix this we would need to:

I tried creating a second route table with the right tag. Unfortunately it just causes all route tables updates to stop. I was hoping it would magically get updated.

k8s-priv-pub-network

chrislovecnm commented 8 years ago

@jkemp101 you mention that public facing nodes are required for public elb. Can you not force an elb to use a public IP, and connect to a node that is in private ip space?

jkemp101 commented 8 years ago

@chrislovecnm That is correct. AWS will stop you because it will detect the routing table for a private subnet does not have an Internet Gateway set as the default route. This is the error message in AWS console This is an Internet-facing ELB, but there is no Internet Gateway attached to the subnet you have just selected: subnet-0653921d. When you create an ELB you first connect it to a subnet and then assign it to instances.

And k8s is also smart enough to know it needs a public subnet to create the ELB. So it will refuse if it can't find one.. But @justinsb suggested labeling manually created subnets after the kops run. That worked fine. I can create services in k8s and it attaches Internet ELBs to the 3 public subnets (I created and labelled manually) and Internal ELBs to the 3 private subnets (kops created automatically).

MrTrustor commented 8 years ago

@jkemp101 Nice work. 2 questions:

jkemp101 commented 8 years ago

@MrTrustor Hope this clarifies. Keep the questions coming.

k8s-priv-pub-network-fixed

chrislovecnm commented 8 years ago

@jkemp101 much thanks for the help btw, I think if you are at kubecon, I owe you libation. Anyways...

Have you been able to do this with kops in its current state or how are you doing this? You mentioned a problem with HA. Please elaborate.

jkemp101 commented 8 years ago

@chrislovecnm I am currently running/testing the configuration depicted in my first diagram. Everything is working well so far. The only HA issue at the moment is that the clusters rely on a single NGW for outbound Internet connections. So if the NGW zone goes down the clusters can no longer do outbound connections to the Internet. Inbound through the Public ELB should still work fine.

I've automated the cluster build and delete process so a single command brings up the cluster and applies all modifications. All settings for public subnets id, NGWs id, IGW id, VPC id, zones, etc. are in a custom yaml file. Here are the 11 steps for a cluster create.

  1. Load my yaml configuration file
  2. Confirm state folder does not exist in S3 for this cluster (Paranoid step 1)
  3. Check if any instances already are running with this cluster tag on them already (Paranoid step 2)
  4. Run kops create cluster with appropriate flags.
  5. Fixup state files in S3 (e.g Add my tags, change IP subnets, etc).
  6. Run kops update cluster
  7. Fixup infrastructure by finding the route table created by kops and replace the route to the IGW with a default route to point to my NGW (ids for both are set in my config file). I have plenty of time as the ASGs start bringing up instances. This is technically the step that turns the cluster into a private configuration.
  8. Using pykube I poll the cluster to wait for it to return the right number of nodes. I know how many make a complete cluster based on settings in my configuration file.
  9. Deploy Dashboard addon
  10. I apply a more restrictive master and node iam policy.
  11. I label the manually created public subnets with this cluster's name. The subnet ids are configured in my yaml configuration file. I never delete the public subnets. Just untag before deleting cluster and retagging after creating cluster.

This script brings up a complete cluster as depicted in the diagram in about 8 minutes with a cluster of 3 masters/3 nodes.

chrislovecnm commented 8 years ago

I am working on testing weave with kops, and once that is done I would like to see how to incorporate this using an external networking provider. With an external networking provider, I don't think K8s will have to manage the three routing tables. Probably setting up a hangout with you to determine where the product gaps are specifically.

chulkilee commented 8 years ago

@jkemp101 glad to hear the progress.. but shouldn't each cluster has own NAT, so that clusters are more isolated from others?

chrislovecnm commented 8 years ago

@jkemp101 do you want to setup a hangout to review this? I have work items scheduled to knock this out, and would like to get the requirements clear. clove at datapipe.com is a good email for me.

rbtcollins commented 8 years ago

Hi, pointed here from https://github.com/kubernetes/kubernetes/issues/34430

Firstly, I want to second https://github.com/kubernetes/kops/issues/428#issuecomment-246624210 - nat gw per AZ + subnets in that AZ have their routes (default or otherwise) pointed at it.

Secondly, I have a couple of variations on the stock use case.

The first one is that I really want to be able to reuse my nat gw's EIP's - though I don't particularly care about the nat gw or subnet, reusing those could be a minor cost saving. The reuse of EIP is to avoid 10 working day lead times on some APIs I need which use source IP ACLs on stuff :).

The second one is that I don't particularly care about private vs public subnets - as long as I can direct traffic for those APIs out via a nat gw with a long lived IP address, I'm happy :) - which may mean that my use case should be a separate bug, but I was pointed here :P.

@justinsb asked about DHCP and routing - I don't think thats feasible in AWS, since their DHCP servers don't support the options needed - https://ercpe.de/blog/advanced-dhcp-options-pushing-static-routes-to-clients - covers the two options, but neither are supported by DHCP Option Set objects in the AWS VPC API.

That said, since a NAT GW is as resilient as an AZ, treat the combination of NAT GW + private subnet as a scaling unit - to run in three AZ's, run three NAT GW's, three private subnets, and each subnet will have one and only one NAT GW route.

ajohnstone commented 8 years ago

@jkemp101 possible to share what you've done so far in a gist/git repo? Sounds like a custom bit of python wrapped over kops.

jkemp101 commented 8 years ago

@ajohnstone Yup. I'll share a git repo shortly with the Python script I'm using. This weekend at the latest.

jkemp101 commented 8 years ago

@ajohnstone Here it is https://github.com/closeio/devops/tree/master/scripts/k8s. Let me know if you have any questions.

chrislovecnm commented 8 years ago

@kris-nova this is the issue that I would like you to work on when you are wanting to OSS. What is the next step? Break apart @jkemp101's python into steps that kops needs? I think these are a set of PRs.

  1. experimental support for all private nodes w/o HA
  2. support for all private node w/ HA.
  3. experimental support for private and public nodes, with private masters w/o HA
  4. support for private and public nodes with private master in HA

p.s. @kris-nova can you reply on this issue, as for some reason I cannot assign it to you.

krisnova commented 8 years ago

Assign to me

erutherford commented 8 years ago

Allowing for additional routes or existing network infrastructure would be great. We're using VPC Peering for environment interconnectivity. This is also how we're currently accessing our kubernetes API via our VPN client. I'm also using a single TCP Load Balancer in front of our HA Kubernetes Backplane to alleviate any DNS stickiness.

krisnova commented 8 years ago

Code changes for the PR coming soon https://github.com/kubernetes/kops/pull/694

starkers commented 8 years ago

Thanks Kris, I'll test soon also

krisnova commented 8 years ago

@starkers - It's still a WIP - let me hammer on it this weekend a bit more before testing. Was just adding the pointer last night, some people were asking about it.

chrislovecnm commented 7 years ago

Little bird is telling me that we may have out first demo on Friday .. no promises, but @kris-nova is kicking some butt!!

druidsbane commented 7 years ago

@kris-nova @chrislovecnm How is that demo going? :) This would be super-useful for creating a simple kubernetes cluster without having to modify the kops terraform output to create the private subnets. Also, hoping for user-supplied security-groups on instance groups soon as well!

hsyed commented 7 years ago

I have put together almost the exact same architecture as listed above. We are trying to get this to a state for production usage.I generate the VPC in terraform and then graft the output of kops onto the VPC output.

Weavenet doesn't stay stable for very long. When I begin populating services into the cluster. It ends with all sorts of networking weirdness I can't diagnose (kubedns becoming blind, certain nodes having 3 second network latency etc etc). Flannel / Callico doesn't work either (out of the box).

I'm happy to battle test the changes. Is there anything I could do to get the Egress route tables populating before Friday ?

chrislovecnm commented 7 years ago

@hsyed need more details on weave. Can you provide details on an open cni issue?

Work still in progress with private networking

jschneiderhan commented 7 years ago

I'm very excited to see all of the progress on this issue! Thanks for all the hard work!

I have been running a cluster in AWS using a setup similar to the "private" topology mentioned in #694. and pretty much a spot on match to to the diagram @jkemp101 created above, where each private subnet has a public "utility" subnet with its own NAT gateway and corresponding route tables which send 0.0.0.0/0 through the AZ's NAT. It all works fine except for one thing: kubernetes stops updating routes because mutliple route tables are found (@jkemp101 also mentioned seeing this behavior). I've had to manually add the routes to all the routing tables every time my set of nodes changes.

It looks as though kubernetes itself does not currently support multiple route tables (https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws_routes.go#L45). I could definitely be missing something (I'm new to go and so my investigation speed is slow), but it seems to me that having kubernetes support multiple routing tables would be a prerequisite to supporting multple private subnets with dedicated NAT gateways, right? I tried searching kubernetes for an existing issue about supporting multiple routing tables, but can't find one (perhaps I'm not using the correct keywords).

hsyed commented 7 years ago

@JSchneiderhan I do not know what approach is being taken by the work being done by @kris-nova. I assumed it was updating multiple route tables.

I had a realisation that there is an alternative architecture that could work. We would need a second network interface on each node. This would be 9 Subnets per cluster. 3 Subnets for kubenet connected to the route table it manages, 3 additional subnets (NAT routing subnet) for the nodes where each subnet is connected to a route table which is connected to a shared NAT gateway in it's AZ. Finally 3 public subnets for ELBs.

The NAT routing subnet would mean dynamically attaching elastic network interfaces as auto scaling groups do not support these.

jschneiderhan commented 7 years ago

@kris-nova I'd be happy to add an issue over in the kubernetes project, but before I do I could use another pair of eyes to make sure I'm not just being stupid:

I think for all of this (really awesome) work to function in AWS without and overlay network, an improvement needs to be make to kubernetes itself. If a subnet-per-AZ is being created, we will end up with multiple VPC routing tables that need to be updated with the CIDR ranges assigned to each node. When kubernetes goes to create the Route, it's going to find multiple tables with the cluster name tag and return an error on this line https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws_routes.go#L45. At least that's what I'm seeing with multiple manually created route tables. It just logs that line multiple times and never updates the route.

So I think this PR does everything right, but in order for kubernetes to do it's thing properly it needs to be improved to iterate over all the route tables and create a route for each one. Again, if that makes sense to you I'm happy to create a kubernetes issue and give a shot at an implementation, but my confidence is pretty low since I'm new to just about every technology involved here :).

chrislovecnm commented 7 years ago

@justinsb thoughts about @jschneiderhan comment? cc: @kubernetes/sig-network ~ can someone give @jschneiderhan any guideance?

@jschneiderhan we are only initially going to be running with CNI in private mode btw.

krisnova commented 7 years ago

If you are interested in testing the current branch you are welcome to run it. More information can be found in the PR https://github.com/kubernetes/kops/pull/694

krisnova commented 7 years ago

Closed with #694