lyft / cni-ipvlan-vpc-k8s

AWS VPC Kubernetes CNI driver using IPvlan
Apache License 2.0
360 stars 58 forks source link

L2 mode versus L3 mode, masquerading and security groups #21

Closed lbernail closed 6 years ago

lbernail commented 6 years ago

Hi,

I've been looking in details at your plugins and it looks really great.

I have a few questions on the design:

In addition, it would be great if different ENIs could have different security groups and pods could be assigned to ENIs based on security requirements (using annotations for instance)

Thanks again for this plugin it looks really promising.

Laurent

paulnivin commented 6 years ago

I've been looking in details at your plugins and it looks really great.

Thanks! Sorry for the lag in following up. Replying to each section:

the plugin is relying on L2 mode (efficient and simple) but would probably work great with L3 mode (even if it will require some ip rules to force traffic through the appropriate ENI). Is there a particular reason for this choice?

When requesting a secondary private IP address on an ENI, the only guarantee is that the IP will be within the subnet assigned to the ENI. All traffic that leaves the ENI must have the same MAC address. There's no ability to assign a contiguous / routable block of IP addresses suitable for using l3 mode on an ENI, unless you dedicate an entire VPC subnet to a single ENI. If you do that, you're going to quickly run out of subnets.

traffic not going to the VPC CIDR block is masqueraded on the main interface. Is that necessary? Having the pod IP address in VPC flowlogs could be interesting and it could also be interesting over VPC peerings (of course that would involve NAT to access the internet)

Traffic not going to the VPC CIDR block is routed back over the default namespace so we can make use of Amazon’s Public IPv4 addressing attribute feature when egressing over the primary private IP of the boot ENI. I'll file a bug to add VPC peering routes to the Pod route table so we'll use the IPvlan interface for that traffic. That should be a simple change.

In addition, it would be great if different ENIs could have different security groups and pods could be assigned to ENIs based on security requirements (using annotations for instance)

We use security groups to segment the control plane, which always runs on the boot ENI, from Pods running on non-boot ENIs. We don't currently have a use-case for scheduling against different ENIs, although we wouldn't be opposed to a PR if there's a straightforward way to handle that case. We're planning to add NetworkPolicy support to our CNI stack using netfilter rules and let the kernel enforce restrictions between Pods instead of relying on security groups to control Pod to Pod comms.

Let us know if you have any other questions.

lbernail commented 6 years ago

Sorry, I also lagged to answer back:

sudo ip link add link eth1 name ipvltest1 type ipvlan mode l3
sudo ip link add link eth1 name ipvltest2 type ipvlan mode l3
sudo ip link set dev ipvltest1 netns test1
sudo ip link set dev ipvltest2 netns test2

sudo ip netns exec test1 ip addr add 172.29.0.100/24 dev ipvltest1
sudo ip netns exec test1 ip link set ipvltest1 up
sudo ip netns exec test1 ip route add default via 172.29.0.1

sudo ip netns exec test2 ip addr add 172.29.0.101/24 dev ipvltest2
sudo ip netns exec test2 ip link set ipvltest2 up
sudo ip netns exec test2 ip route add default via 172.29.0.1

echo "1    eni1" |sudo tee -a /etc/iproute2/rt_tables
sudo ip route add 172.29.0.1/32 scope link  dev eth1 table eni1
sudo ip route add default via 172.29.0.1 dev eth1 table eni1

sudo ip rule add from 172.29.0.100/32 table eni1
sudo ip rule add from 172.29.0.101/32 table eni1

Where test1 and test2 are two different container namespaces and 172.29.0.100/172.29.0.101 IP addresses associated with the secondary ENI (eth1).

This setup is very similar to a L2 mode, but requires the addition of the IP rule to force traffic from container namespaces through the additional ENI. The traffic between namespaces works even if it not routed by the IPVLAN primary interface because both sub-interfaces are in the same subnet.