kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.92k stars 4.65k forks source link

Created subnets are unnecessarily small #7193

Open mohag opened 5 years ago

mohag commented 5 years ago

1. What kops version are you running? The command kops version, will display this information. 1.12.1

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. 1.12.9

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops create cluster --associate-public-ip=False --bastion=True --cloud=aws --dns-zone=k8s.domain.com --image 595879546273/CoreOS-stable-2135.4.0-hvm --master-size=t3.large --master-zones=eu-central-1a,eu-central-1b,eu-central-1c --master-count=3 --master-volume-size=100 --network-cidr=172.31.239.0/24 --networking=calico --node-count=1 --node-size=m5a.2xlarge --node-volume-size=300 --target=direct --topology=private --zones=eu-central-1a,eu-central-1b,eu-central-1c --ssh-public-key=~/.ssh/k8s-dev-id_rsa.pub --name=dev-qa.k8s.domain.com --state s3://my-state-store --authorization=rbac --encrypt-etcd-storage --kubernetes-version 1.12.9

kops update cluster --name=dev-qa.k8s.domain.com --state s3://my-state-store --yes

5. What happened after the commands executed? KOPS fails to create resources due to "error creating subnet: InvalidSubnet.Range: The CIDR '172.31.239.8/30' is invalid."

Three normal /27 subnets were created (in the cluster config), Attempts were made to create (too small) /30 subnets for the utility subnets. This would use around half (/25) of the given IP range for the subnets, leaving the rest unused.

6. What did you expect to happen?

Three normal /26 subnets should be created and three /28 utility subnets should be created, using most of the given address space in the VPC. (One per AZ per type)

If the VPC is shared with other things, smaller subnets can be used.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information. https://gist.github.com/mohag/2244eb229b15e776fd65b84a35a8cfb0

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here. Issue is visible in the generated manifest. https://gist.github.com/mohag/2244eb229b15e776fd65b84a35a8cfb0#file-kops-create-cluster-log

9. Anything else do we need to know? This seems to match the behaviour described in the comments here. Ideally the number of subnets split into should be based on max('the number of availability zones specified',3)+1. (It allows for at least three AZs to be added. I'm not aware of providers with more than that in a region and if they do exist and are specified when the cluster is created, the subnet sizes can be reduced proportionally)

(Domain name and s3 bucket name substituted)

Manually substituting the subnets part with

  subnets:
  - cidr: 172.31.239.64/26
    name: eu-central-1a
    type: Private
    zone: eu-central-1a
  - cidr: 172.31.239.128/26
    name: eu-central-1b
    type: Private
    zone: eu-central-1b
  - cidr: 172.31.239.192/26
    name: eu-central-1c
    type: Private
    zone: eu-central-1c
  - cidr: 172.31.239.0/28
    name: utility-eu-central-1a
    type: Utility
    zone: eu-central-1a
  - cidr: 172.31.239.16/28
    name: utility-eu-central-1b
    type: Utility
    zone: eu-central-1b
  - cidr: 172.31.239.32/28
    name: utility-eu-central-1c
    type: Utility
    zone: eu-central-1c

gets them created. The kops update then also succeed.

The created cluster validates as well.

mohag commented 5 years ago

(My goal is to use the smallest range practical to reduce the chances of overlap with peered VPCs). Splitting the subnet into 8 parts instead of four each time, means that the subnet needs to be two bits larger than necessary.

(With manual splitting, I'm not sure if the minimum size is a /24 or a /25. With the current default splitting, the minimum without manually editing the subnets is likely a /22 (if utility subnets of /29 works) or a /21 (if utility subnets needs to be at least /28))

Edit: It seems like the ELBs for the cluster are created in the utility subnets. With that, it might make sense to divide the range equally... Clusters might often have as a similar number, of not more load balancers (especially if Ingresses are not used or non-HTTP(S) services are used) than nodes. (That might mean just dividing the initial subnet into 2^(ceil(log2((number of AZs)*2))) subnets instead (i.e. for 3 AZs, just dividing it into 8 equal parts and using 3 for utility subnets and 3 for normal/node subnets might make sense)) (6 /27 subnets for a /24, instead of 3 /26 subnets and 3 /27 subnets) (I'm not exactly sure what ends up in each of the subnet types)

mohag commented 5 years ago

(The CNI plugin also seems to have an effect - the Amazon VPC CNI plugin needs huge ranges for the node subnets, since the pods directly use IPs from there. CNI plugins using their own ranges can get away with much smaller node subnets) (But this is about the size of the subnets relative to the CIDR range given)

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

mohag commented 4 years ago

/remove-lifecycle stale

khos2ow commented 4 years ago

@mohag did you by any chance find a workaround for this? I just faced the exact same issue with regard to /24 creates sets of /27 and /30 which the latter fails.

mohag commented 4 years ago

@khos2ow I'm currently using EKS instead.

You can manually edit the subnets before creating, but that is no fun...

olemarkus commented 4 years ago

The logic right now is to take the network CIDR, divide it into 8, and then take the last of those 8 and divide further into 8 subnets.

That logic makes in most cases, but there are edge cases such as these.

For the CNIs that use native ENI networking, I think the solution is more in the direction of creating dedicated subnets for that that is outside of standard networks kops creates. I'd handle that as a separate issue.

olemarkus commented 4 years ago

BTW, based on looking at some real clusters, the private subnets consume way more IPs than the utility subnets.

johngmyers commented 4 years ago

That logic allows the later addition of availability zones, up to a total of seven.

We don't expect to add zones after cluster creation and usually use three zones, so our wrapper around kops divides the CIDR based on the number of initial zones. For four to seven zones it uses the same algorithm as kops. For two or three zones it divides by 4, with the last further divided into another 4. For a single zone it divides in half, with one of those further divided in half.

olemarkus commented 4 years ago

I think we can do things a bit simpler for kops create. The purpose of running kops create cluster is just to get a cluster going. If one has special needs (always with a production setup imho), one should use kops create -f <spec>. But kops should validate the CIDR a bit more, and I think being able to start smaller clusters do make sense.

So a suggestion for a fix is that we don't allow clusters with networkCIDR netmask smaller than 24. If the netmask is 23 or 24, we divide into 4 subnets instead of 8. Smaller subnets doesn't work very well.

If that sounds good, I should be able to create a PR fairly quickly.

johngmyers commented 4 years ago

That /24 boundary seems arbitrary. It seems to be incorporating the AWS-specific restriction on subnets not being smaller than /28.

Our automation, including for production clusters, runs kops create into a temporary bucket, extracts the generated spec, makes changes to it, then uses kops create -f or kops replace -f into the real bucket.

I think keying off of the maximum expected number of zones makes more sense than keying off of the network size. If you're only ever going to run a big cluster in three zones then you don't want to waste half of your address space. And you don't want to manually create those subnets with all their attendant subresources.

johngmyers commented 4 years ago

kops create cluster and kops create -f both call cloudup.PerformAssignments(), which splits up the network CIDR. It also happens for such commands as edit cluster, set cluster, upgrade cluster, and so on.

mohag commented 4 years ago

Why always support up to 7 availability zones? Anything more than 3 in a single region seems to be extremely rare? (I'm assuming that clusters across multiple regions are not common) (The amount of subnets needed are known as well, which means that adapting to the relevant power of two for the split should be possible as well?) (The split function can gain a parameter for the number of parts needed for example...)

Our current (currently EKS) clusters have much larger private subnets, but they use the AWS-VPC-CNI provider, which use these IPs for pods. Other CNI options should only require private IPs for the nodes and internal load balancers?

olemarkus commented 4 years ago

My suggestion was only for allowing smaller clusters without redoing the entire logic for how we split subnets. Changing the logic for any CIDR would be larger discussion, I think.

But yeah, more than 3 masters is rare. Up to seven seems overkill for any scenario.

The CNIs that use ENI is a special case. Here we should create dedicated subnets that are used for pods so the 8 split actually makes sense. That would also make Lyft VPC work out of the box. The minimum CIDR netmask need to be much smaller in this case too.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

johngmyers commented 4 years ago

/remove-lifecycle stale /lifecycle frozen