cni-ipvlan-vpc-k8s
contains a set of
CNI and IPAM plugins to
provide a simple, host-local, low latency, high throughput, and compliant
networking stack for
Kubernetes
within Amazon Virtual Private Cloud
(VPC) environments by making use of
Amazon Elastic Network Interfaces
(ENI)
and binding AWS-managed IPs into Pods using the Linux kernel's IPvlan
driver in L2 mode.
The plugins are designed to be straightforward to configure and deploy
within a VPC. Kubelets boot and then self-configure and scale their IP
usage as needed, without requiring the often recommended complexities
of administering overlay networks, BGP, disabling source/destination
checks, or adjusting VPC route tables to provide per-instance subnets
to each host (which is limited to 50-100 entries per VPC). In short,
cni-ipvlan-vpc-k8s
significantly reduces the network complexity
required to deploy Kubernetes at scale within AWS.
The maximum number of Pods per AWS instance is determined by ENI limits. Instance types offering 8 ENIs can scale up to and beyond the default Kubernetes limit of 110 pods per instance.
The primary EC2 boot ENI with its primary private IP is used as the IP address for the node. Our CNI plugins manage additional ENIs and private IPs on those ENIs to assign IP addresses to Pods.
Each Pod contains two network interfaces, a primary IPvlan interface and an unnumbered point-to-point virtual ethernet interface. These interfaces are created via a chained CNI execution.
For applications where Pods need to directly communicate with the Internet, by setting the default route to the unnumbered point-to-point interface, our stack can source NAT traffic from the Pod over the primary private IP of the boot ENI, which enables making use of Amazon’s Public IPv4 addressing attribute feature. When enabled, Pods can egress to the Internet without needing to manage Elastic IPs or NAT Gateways.
Kubelets and Daemon Sets have high bandwidth, host-local access to all Pods running on the instance — traffic doesn’t transit ENI devices. Source and destination IPs are the well-known Kubernetes addresses on either side of the connect.
Our design is heavily optimized for intra-VPC traffic where IPvlan is the only overhead between the instance’s ethernet interface and the Pod network namespace. We bias toward traffic remaining within the VPC and not transiting the IPv4 Internet where veth and NAT overhead is incurred. Unfortunately, many AWS services require transiting the Internet; however, both DynamoDB and S3 offer VPC gateway endpoints.
While we have not yet implemented IPv6 support in our CNI stack, we have plans to do so in the near future. IPv6 can make use of the IPvlan interface for both VPC traffic as well as Internet traffic, due to AWS’s use of public IPv6 addressing within VPCs and support for egress-only Internet Gateways. NAT and veth overhead will not be required for this traffic.
We’re planning to migrate to a VPC endpoint for DynamoDB and use native IPv6 support for communication to S3. Biasing toward extremely low overhead IPv6 traffic with higher overhead for IPv4 Internet traffic is the right future direction.
cni-ipvlan-vpc-k8s
is used in production at Lyft with cri-o for
non-GPU workloads and Docker w/ nvidia-docker for GPU workloads.
Note that for cri-o, manage_network_ns_lifecycle
must be set to
true.
kubelet
process must be started with the --node-ip
option
if you also use --cloud-provider=aws
. Use the primary IP on
the boot ENI adapter (eth0).AWS permissions allowing at least these actions on the Kubelet role:
"ec2:DescribeSubnets"
"ec2:AttachNetworkInterface"
"ec2:AssignPrivateIpAddresses"
"ec2:UnassignPrivateIpAddresses"
"ec2:CreateNetworkInterface"
"ec2:DescribeNetworkInterfaces"
"ec2:DetachNetworkInterface"
"ec2:DeleteNetworkInterface"
"ec2:ModifyNetworkInterfaceAttribute"
"ec2:DescribeInstanceTypes"
"ec2:DescribeVpcs"
"ec2:DescribeVpcPeeringConnections"
ec2:DescribeVpcs is required for m5 and c5 instances because the AWS metadata server does not return the secondary CIDR block on these instance types. This requirement will be removed when the issue is fixed.
ec2:DescribeVpcPeeringConnections is only required if routeToVpcPeers is enabled on the plugin.
See Security Considerations below for more on the implications of these permissions.
cni-ipvlan-vpc-k8s requires dep
for dependency management. Please see
https://github.com/golang/dep#setup for build instructions. In a
pinch, you may go get -u github.com/golang/dep/cmd/dep
.
go get github.com/lyft/cni-ipvlan-vpc-k8s
cd $GOPATH/src/github.com/lyft/cni-ipvlan-vpc-k8s
make build
This example CNI conflist creates Pod IPs on the secondary and above
ENI adapters and chains with the upstream ipvlan plugin (0.7.0 or
later required) and the cni-ipvlan-vpc-k8s-unnumbered-ptp
plugin to
create unnumbered point-to-point links back to the default namespace
from each Pod. New interfaces will be attached to subnets tagged with
kubernetes_kubelet
= true
, and created with the defined security
groups.
Routes are automatically formed for the VPC on the ipvlan
adapter.
ipMasq is enabled to use the host-IP for egress to the Internet as
well as providing access to services such as kube2iam
. kube2iam
is
not a dependency of this software.
{
"cniVersion": "0.3.1",
"name": "cni-ipvlan-vpc-k8s",
"plugins": [
{
"cniVersion": "0.3.1",
"type": "cni-ipvlan-vpc-k8s-ipam",
"interfaceIndex": 1,
"subnetTags": {
"kubernetes_kubelet": "true"
},
"secGroupIds": [
"sg-1234",
"sg-5678"
]
},
{
"cniVersion": "0.3.1",
"type": "cni-ipvlan-vpc-k8s-ipvlan",
"mode": "l2"
},
{
"cniVersion": "0.3.1",
"type": "cni-ipvlan-vpc-k8s-unnumbered-ptp",
"hostInterface": "eth0",
"containerInterface": "eth1",
"ipMasq": true
}
]
}
In the above cni-ipvlan-vpc-k8s-ipam
config, several options are
available:
interfaceIndex
: We also recommend never using the boot ENI
adapter with this plugin (though it is possible). By setting
interfaceIndex
to 1, the plugin will only allocate IPs (and add
new adapters) starting at eth1
.subnetTags
: When allocating new adapters, by default the plugin
will use all available subnets within the availability zone. You
can restrict which subnets the plugin will use by specifying key /
value tag names that must be matched in order for the plugin to be
considered. These tags are set via the AWS API or in the AWS
Console on the subnet object.secGroupIds
: When allocating a new ENI adapter, these interface
groups will be assigned to the adapter. Specify the sg-xxxx
interface group ID.skipDeallocation
: true
or false
- when set to true
, this
plugin will never remove a secondary IP address from an
adapter. Useful in workloads that churn many pods to reduce the AWS
ratelimits for configuring the VPC (which are low and cannot be
raised above a certain threshold). routeToVpcPeers
: true
or false
- When set to true
, the
plugin will make a (cached) call to DescribeVpcPeeringConnections
to enumerate all peered VPCs. Routes will be added so connections
to these VPCs will be sourced from the IPvlan adapter in the pod
and not through the host masquerade.routeToCidrs
: List of CIDRs. Routes will be added so connections
to these CIDRs will be sourced from the IPvlan adapter in the pod
and not through the host masquerade.
reuseIPWait
: Seconds to wait before free IP addresses are made
available for reuse by Pods. Defaults to 60 seconds. reuseIPWait
functions as both a lock to prevent addresses from being grabbed by
Pods spinning up in between the stages of chained CNI plugin
execution and as a method of delaying when a new Pod can grab the
same IP address of a terminating Pod.As new Pods are created, if needed, secondary IP addresses are added to secondary ENI adapters until they reach capacity. A lightweight file-based registry stores hints containing free IP addresses available to the instance to prevent unnecessary churn from adding and removing IPs to and from ENI adapters, which is a fairly heavyweight AWS process. By default, free IP addresses are made available for reuse by Pods after being unused for at least 60 seconds. To handle cases where IPs are not frequently reused by Pods, and an excess of free IP addresses becomes available on an instance, a systemd timer is recommended to garbage collect these old IPs.
Sample cni-gc.service:
Description=Garbage collect IPs unused for 15 minutes
[Service]
Type=oneshot
ExecStart=/usr/local/bin/cni-ipvlan-vpc-k8s-tool registry-gc --free-after=15m
Sample cni-gc.timer:
[Unit]
Description=Run cni-gc every 5 minutes
[Timer]
OnBootSec=5min
OnUnitActiveSec=5min
[Install]
WantedBy=timers.target
This plugin ships a CLI tool which can be useful to inspect the state of the system or perform certain actions (such as provisioning an adapter at instance cloud-init time).
Run cni-ipvlan-vpc-k8s-tool --help
for a complete listing of
options.
NAME:
cni-ipvlan-vpc-k8s-tool - Interface with ENI adapters and CNI bindings for those
USAGE:
cni-ipvlan-vpc-k8s-tool [global options] command [command options] [arguments...]
VERSION:
v-next
COMMANDS:
new-interface Create a new interface
remove-interface Remove an existing interface
deallocate Deallocate a private IP
allocate-first-available Allocate a private IP on the first available interface
free-ips List all currently unassigned AWS IP addresses
eniif List all ENI interfaces and their setup with addresses
addr List all bound IP addresses
subnets Show available subnets for this host
limits Display limits for ENI for this instance type
bugs Show any bugs associated with this instance
vpccidr Show the VPC CIDRs associated with current interfaces
vpcpeercidr Show the peered VPC CIDRs associated with current interfaces
registry-list List all known free IPs in the internal registry
registry-gc Free all IPs that have remained unused for a given time interval
help, h Shows a list of commands or help for one command
GLOBAL OPTIONS:
--help, -h show help
--version, -v print the version
COPYRIGHT:
(c) 2017-2018 Lyft Inc.
In Kubernetes, pods and kubelets are assumed to have static IP addresses that
are assigned for the lifetime of the object. However, the EC2 IAM permissions
required by cni-ipvlan-vpc-k8s
enable authorized principals to manipulate
network interfaces and IP addresses, which could be used to remap IP addresses
and "take over" the IP address of an existing pod or kubelet. Such an IP
address takeover could allow impersonation of a pod or kubelet at the network
layer, and disrupt the availability of your Kubernetes cluster.
IP address takeovers are possible in the following situations:
cni-ipvlan-vpc-k8s
with
the required IAM permissions.Consider taking the following actions to reduce the likelihood and impact of IP takeover attacks: