kubernetes / cloud-provider-aws

Cloud provider for AWS
https://cloud-provider-aws.sigs.k8s.io/
Apache License 2.0
376 stars 300 forks source link

Multiple ENIs is confusing cloud-provider-aws controller #890

Open MadJlzz opened 3 months ago

MadJlzz commented 3 months ago

What happened:

I am working on deploying a Kubernetes cluster using cluster api and amazon-vpc-cni as the network manager of the cluster.

During my tests I observed a pretty strange behaviour of the cloud-provider-aws controller.

In fact, the Kubernetes node object internal IP changed from the private IP of the EC2 ENI (the one provisioned alongside the creation of instance) to the private IP of the ENI that was provisioned by the AWS VPC CNI controller. During my tests, I also saw that this behaviour was quite random.

It leads to a lot of problems such as kubectl not being able to send back results of commands such as kubectl logs or kubectl exec since kube-apiserver is forwarding those requests to the node hosting the pod using its internal IP fetched from the Node resource.

What I cannot explain though, is why this secondary private IP attached to the same instance is not answering properly those calls even though the firewall was allowing any kind of traffic from any source.

I've implemented a workaround to this issue by simply getting the primary IP of the node during runtime and passing the flag --node-ip to the kubelet before actually starting it.

To be sure that cloud-provider-aws don't override what I did, I've also set --allocate-node-cidrs=false flag.

What you expected to happen:

Once the Node object internal IP is set ; it should not be replaced by the one of the other ENI. Or using the other IP should not be a problem and then this issue is becoming a networking problem for the CNI team.

Anything else we need to know?:

Here's a screenshot that exposes the behaviour. The top pane shows it initially and the second pane the changed IPs after I've deployed the aws-vpc-cni + cloud-provider-aws controller.

Screenshot from 2024-03-28 16-25-41

Environment:

image:
    tag: v1.29.1

args:
  - --v=2
  - --allocate-node-cidrs=true
  - --cloud-provider=aws
  - --cluster-name="k993aws"
  - --cluster-cidr="10.0.0.0/16"
  - --configure-cloud-routes=false

/kind bug

k8s-ci-robot commented 3 months ago

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
kmala commented 1 month ago

Once the Node object internal IP is set ; it should not be replaced by the one of the other ENI.

I don't think IP is replaced. kubectl shows just the one IP but the node object should have all the IP's as that is the default behavior and they should be ordered based on the interface number https://github.com/kubernetes/cloud-provider-aws/blob/cea2af6c7a3a6b546b62577636415e70459e1fc5/pkg/providers/v1/aws.go#L735-L757 from cloud provider release 1.29.3. Can you upgrade and test?

Node controller will make sure that addresses of the instance is always same as node object addresses https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cloud-provider/controllers/node/node_controller.go#L193-L197.

What I cannot explain though, is why this secondary private IP attached to the same instance is not answering properly those calls even though the firewall was allowing any kind of traffic from any source.

what is the error you are facing , did you check the apisever logs for the reason? It could be because of cert verification also.

cartermckinnon commented 1 month ago

Do you pass the --node-ip flag to kubelet?

MadJlzz commented 1 month ago

what is the error you are facing , did you check the apisever logs for the reason? It could be because of cert verification also.

It's been quite some time, I have to dig back into it to get extra details. I had problem getting back results from commands like kubectl logs or kubectl exec being proxified by the api-server to the correct node's kubelet.

Do you pass the --node-ip flag to kubelet?

I had to do that as a workaround, yes. The IP I have set is the primary IP of initial network interface of the EC2 instance.

As soon as I have time, I'll try to get some more informations and put them here.

cartermckinnon commented 1 month ago

There's been some recent discussions about --node-ip and how the external CCM should handle it. At this point, passing --node-ip to kubelet is the right thing to do, for AWS at least. Here's how we do it for the AL2-based EKS AMI: https://github.com/awslabs/amazon-eks-ami/blob/e50acfb7e6be088dde823dc80b21c50651e71b01/templates/al2/runtime/bootstrap.sh#L490-L495

More: https://github.com/kubernetes/kubernetes/pull/125337