Open mohag opened 5 years ago
(My goal is to use the smallest range practical to reduce the chances of overlap with peered VPCs). Splitting the subnet into 8 parts instead of four each time, means that the subnet needs to be two bits larger than necessary.
(With manual splitting, I'm not sure if the minimum size is a /24 or a /25. With the current default splitting, the minimum without manually editing the subnets is likely a /22 (if utility subnets of /29 works) or a /21 (if utility subnets needs to be at least /28))
Edit: It seems like the ELBs for the cluster are created in the utility subnets. With that, it might make sense to divide the range equally... Clusters might often have as a similar number, of not more load balancers (especially if Ingresses are not used or non-HTTP(S) services are used) than nodes. (That might mean just dividing the initial subnet into 2^(ceil(log2((number of AZs)*2))) subnets instead (i.e. for 3 AZs, just dividing it into 8 equal parts and using 3 for utility subnets and 3 for normal/node subnets might make sense)) (6 /27 subnets for a /24, instead of 3 /26 subnets and 3 /27 subnets) (I'm not exactly sure what ends up in each of the subnet types)
(The CNI plugin also seems to have an effect - the Amazon VPC CNI plugin needs huge ranges for the node subnets, since the pods directly use IPs from there. CNI plugins using their own ranges can get away with much smaller node subnets) (But this is about the size of the subnets relative to the CIDR range given)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
@mohag did you by any chance find a workaround for this? I just faced the exact same issue with regard to /24
creates sets of /27
and /30
which the latter fails.
@khos2ow I'm currently using EKS instead.
You can manually edit the subnets before creating, but that is no fun...
The logic right now is to take the network CIDR, divide it into 8, and then take the last of those 8 and divide further into 8 subnets.
That logic makes in most cases, but there are edge cases such as these.
For the CNIs that use native ENI networking, I think the solution is more in the direction of creating dedicated subnets for that that is outside of standard networks kops creates. I'd handle that as a separate issue.
BTW, based on looking at some real clusters, the private subnets consume way more IPs than the utility subnets.
That logic allows the later addition of availability zones, up to a total of seven.
We don't expect to add zones after cluster creation and usually use three zones, so our wrapper around kops divides the CIDR based on the number of initial zones. For four to seven zones it uses the same algorithm as kops. For two or three zones it divides by 4, with the last further divided into another 4. For a single zone it divides in half, with one of those further divided in half.
I think we can do things a bit simpler for kops create
. The purpose of running kops create cluster
is just to get a cluster going. If one has special needs (always with a production setup imho), one should use kops create -f <spec>
. But kops should validate the CIDR a bit more, and I think being able to start smaller clusters do make sense.
So a suggestion for a fix is that we don't allow clusters with networkCIDR
netmask smaller than 24.
If the netmask is 23 or 24, we divide into 4 subnets instead of 8. Smaller subnets doesn't work very well.
If that sounds good, I should be able to create a PR fairly quickly.
That /24 boundary seems arbitrary. It seems to be incorporating the AWS-specific restriction on subnets not being smaller than /28.
Our automation, including for production clusters, runs kops create
into a temporary bucket, extracts the generated spec, makes changes to it, then uses kops create -f
or kops replace -f
into the real bucket.
I think keying off of the maximum expected number of zones makes more sense than keying off of the network size. If you're only ever going to run a big cluster in three zones then you don't want to waste half of your address space. And you don't want to manually create those subnets with all their attendant subresources.
kops create cluster
and kops create -f
both call cloudup.PerformAssignments()
, which splits up the network CIDR. It also happens for such commands as edit cluster
, set cluster
, upgrade cluster
, and so on.
Why always support up to 7 availability zones? Anything more than 3 in a single region seems to be extremely rare? (I'm assuming that clusters across multiple regions are not common) (The amount of subnets needed are known as well, which means that adapting to the relevant power of two for the split should be possible as well?) (The split function can gain a parameter for the number of parts needed for example...)
Our current (currently EKS) clusters have much larger private subnets, but they use the AWS-VPC-CNI provider, which use these IPs for pods. Other CNI options should only require private IPs for the nodes and internal load balancers?
My suggestion was only for allowing smaller clusters without redoing the entire logic for how we split subnets. Changing the logic for any CIDR would be larger discussion, I think.
But yeah, more than 3 masters is rare. Up to seven seems overkill for any scenario.
The CNIs that use ENI is a special case. Here we should create dedicated subnets that are used for pods so the 8 split actually makes sense. That would also make Lyft VPC work out of the box. The minimum CIDR netmask need to be much smaller in this case too.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale /lifecycle frozen
1. What
kops
version are you running? The commandkops version
, will display this information. 1.12.12. What Kubernetes version are you running?
kubectl version
will print the version if a cluster is running or provide the Kubernetes version specified as akops
flag. 1.12.93. What cloud provider are you using? AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
kops update cluster --name=dev-qa.k8s.domain.com --state s3://my-state-store --yes
5. What happened after the commands executed? KOPS fails to create resources due to "error creating subnet: InvalidSubnet.Range: The CIDR '172.31.239.8/30' is invalid."
Three normal /27 subnets were created (in the cluster config), Attempts were made to create (too small) /30 subnets for the utility subnets. This would use around half (/25) of the given IP range for the subnets, leaving the rest unused.
6. What did you expect to happen?
Three normal /26 subnets should be created and three /28 utility subnets should be created, using most of the given address space in the VPC. (One per AZ per type)
If the VPC is shared with other things, smaller subnets can be used.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest. You may want to remove your cluster name and other sensitive information. https://gist.github.com/mohag/2244eb229b15e776fd65b84a35a8cfb08. Please run the commands with most verbose logging by adding the
-v 10
flag. Paste the logs into this report, or in a gist and provide the gist link here. Issue is visible in the generated manifest. https://gist.github.com/mohag/2244eb229b15e776fd65b84a35a8cfb0#file-kops-create-cluster-log9. Anything else do we need to know? This seems to match the behaviour described in the comments here. Ideally the number of subnets split into should be based on max('the number of availability zones specified',3)+1. (It allows for at least three AZs to be added. I'm not aware of providers with more than that in a region and if they do exist and are specified when the cluster is created, the subnet sizes can be reduced proportionally)
(Domain name and s3 bucket name substituted)
Manually substituting the subnets part with
gets them created. The
kops update
then also succeed.The created cluster validates as well.