Closed chris-h-phillips closed 6 years ago
@polarbizzle Was this on c5 / m5 hardware or a different instance family?
We generally prefer metadata-lookups as it doesn't require any account scopes, and spreads any rate-limiting behavior of AWS APIs around to the instances, however a blank routes entry is very much a problem. Perhaps we should error in this case and let the CNI driver retry creating the namespace?
Sorry, didn't see #37 - this approach sounds good to me to handle issues with the metadata service. On c5/m5, metadata population and VPC bringup is actually handled asynchronously from instance boot, which does lead to errors like this (as well as actually not being able to use the network interface at all).
@theatrus thanks for the response that makes sense to me. We're actually using m4.xlarge instances for our nodes.
I installed the plugin in our dev cluster and after much whack-a-mole rescheduling of pods that couldn't talk to other pods things settled down. The problem pods kept coming back when new nodes would be added to the cluster. We use kops and the rolling update causes a lot of pod rescheduling. Eventually I started to look closer and found the trouble pods didn't have any routes for the vpc ranges at all. I also noticed that those pods were all the first pod that got assigned to a new eni whether it was a new node or just the next needed eni on a busy node.
Looking at the code it seems that this lag between new eni and ec2 metadata service results being fully populated is a known thing. The ipam plugin depends on the ranges from the vpc-ipv4-cidr-blocks section of the metadata to set up the correct routes for the veth interfaces. However, the parsing of the vpc-ipv4-cidr-blocks call doesn't return an error if no cidrs were found. This results in an interface with no routes to the vpc.
So one approach to solving that could be simply checking the length of the vpc-ipv4-cidr-blocks slice and returning an error if it's 0. That would cause a retry on the metadata service until it gets a result. I'm somewhat concerned about that solution because I wonder if there is an intermediate state where some but not all of the VPC ranges get returned and so we're back where we started with required routes missing.
Another way of solving it would be to just use the describe vpcs api to get the vpc ranges all the time instead of relying on the metadata service. It appears that there is some desire to not use the describe VPC api calls to get the vpc cidrs. I'd like to get more context from those who know about that preference. We already have to have quite a few IAM permissions available to make the plugin work it doesn't seem like describe vpcs is that onerous as a part of that list. Maybe it's just to have one less place where config data is extracted from?