Closed raonitimo closed 1 year ago
After restarting kubelet, it shows:
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.598146 23137 server.go:413] "Kubelet version" kubeletVersion="v1.25.11-eks-a5565ad"
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.598158 23137 server.go:415] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.598213 23137 feature_gate.go:245] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCer
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.598295 23137 feature_gate.go:245] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCer
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: W0804 04:54:52.598391 23137 plugins.go:132] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is depr
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.598411 23137 aws.go:1268] Get AWS region from metadata client
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.598739 23137 aws.go:1313] Zone not specified in configuration file; querying AWS metadata service
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.600302 23137 aws.go:1353] Building AWS cloudprovider
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.811489 23137 tags.go:80] AWS cloud filtering on ClusterID: release-xds-0
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.811514 23137 server.go:555] "Successfully initialized cloud provider" cloudProvider="aws" cloudConfigFile=""
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.811526 23137 server.go:993] "Cloud provider determined current node" nodeName="ip-10-145-250-1.ec2.internal"
It still fails to register the node. I see the error:
Aug 04 05:15:59 ip-10-145-250-1.ec2.internal kubelet[23137]: E0804 05:15:59.270406 23137 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes is forbidden: User \"system:node:\" cannot create resource \"nodes\" in API group \"\" at the cluster scope: unknown node for user \"system:node:\"" node="ip-10-145-250-1.ec2.internal"
It looks like the kubelet is not able to use the correct user that should be system:node:{{EC2PrivateDNSName}}
.
I ran the bootstrap.sh script again and restarted the kubelet again, then the node joined the cluster.
AWS support engineer linked me this: https://github.com/kubernetes/kubernetes/pull/118421 . So, I guess the in-tree code is still used in 1.25?
Anyway, is the code being kept in-sync with the fixes?
So, I guess the in-tree code is still used in 1.25?
Yes, that's true. The switch happens with 1.27
Anyway, is the code being kept in-sync with the fixes?
Can you explain what you meant by it?
Hey @kmala , thanks for responding.
Anyway, is the code being kept in-sync with the fixes?
Can you explain what you meant by it?
Yes, sure. I meant to ask if this PR is merged to the in-tree code in 1.26, should we expect to have the same fix applied to this plugin and working on version 1.27?
Looks legit.
/triage accepted
should we expect to have the same fix applied to this plugin and working on version 1.27 ?
yes it would be merged in this repo as far as i can tell. @cartermckinnon can comment otherwise
That PR wouldn’t help here, because kubelet doesn’t use this code. It has to be merged to the legacy in-tree AWS cloud provider in versions prior to 1.27. I haven’t gotten much traction on that PR, so please bump it if this is a blocker for you. 😌 I’ll go ahead and get this patched in the EKS kubelet builds, at least, because we’ll be supporting 1.26 for a while.
We’re handling the PrivateDnsName quirks in 1.27+ with a hostname override: https://github.com/awslabs/amazon-eks-ami/blob/master/files/bootstrap.sh#L536-537
But we still need to address the eventual consistency issue. I’ll put up a PR for that.
The proper fix will be in the aws-iam-authenticator, I think.
Hey @cartermckinnon, thanks for responding.
We’re handling the PrivateDnsName quirks in 1.27+ with a hostname override: awslabs/amazon-eks-ami@
master
/files/bootstrap.sh#L536-537
IIUC, this override won't fix this particular issue because the DescribeInstances call doesn't fail. It just returns an empty string.
But we still need to address the eventual consistency issue. I’ll put up a PR for that.
The proper fix will be in the aws-iam-authenticator, I think.
Nice. I don't understand how the aws-iam-authenticator is related. Keen to see the PR and understand it. Please link it here.
Thanks!
IIUC, this override won't fix this particular issue because the DescribeInstances call doesn't fail. It just returns an empty string.
Correct -- I just meant to point out how we're achieving the behavior (Node
name == PrivateDnsName
) on kubelets that no longer use the in-tree AWS cloud provider. We still need to address the eventual consistency problem; I intended to do so when there was some consensus on the issue upstream.
I don't understand how the aws-iam-authenticator is related.
On EKS, the aws-iam-authenticator
is where the PrivateDnsName
requirement comes from, i.e. entries in your configmap/aws-auth
like system:node:{{EC2PrivateDNSName}}
.
Hey @cartermckinnon, this problem appeared again. Do you have any rough timeline for the fix? Or any pointers on how this should be fixed so someone can contribute?
kubelet
builds for Kubernetes 1.23-1.26. Those patches will appear in EKS-D in the next couple weeks: https://github.com/aws/eks-distro/tree/main/projects/kubernetes/kubernetes. The patched kubelet
-s will land in an upcoming EKS AMI release, but I can't guarantee it'll be the same release as 1.I'll reach out to the Bottlerocket folks to see what a fix looks like on their end for 1.27+.
Edit: looks like we'll need some handling here: https://github.com/bottlerocket-os/bottlerocket/blob/dea2c11949a95e914b3c72be6456606e945e0e16/sources/api/pluto/src/main.rs#L316-L332
@raonitimo I want to make sure we choose the right timeout value, so I need to track down a recent occurrence of this issue in the EC2 backend. Can you share some instance ID's? If you want to open a case with AWS Support, I can track it down 👍 .
@raonitimo I want to make sure we choose the right timeout value, so I need to track down a recent occurrence of this issue in the EC2 backend. Can you share some instance ID's? If you want to open a case with AWS Support, I can track it down 👍 .
Sorry @cartermckinnon, haven't got a recent instance Id. When I get one, I'll raise a case with support and ping you.
We've patched in handling for this in the EKS kubelet builds, so going to close this. I think a proper fix is to remove usage of the PrivateDnsName
altogether, which I'm scoping for a future EKS release.
/close
@cartermckinnon: Closing this issue.
What happened:
Some nodes fail to join the cluster, kubelet logs has
Node events show
No other errors logged.
What you expected to happen:
Kubelet would correctly figure out the node name.
How to reproduce it (as minimally and precisely as possible):
It doesn't happen all the time and I can't correlate with anything. It's happened across different EKS clusters across different AWS accounts.
I can see two DescribeInstances API calls in Cloudtrail event history within the same second at
"2023-08-03T19:00:56Z"
, just like an instance that successfully joined the cluster.Anything else we need to know?:
Environment:
kubectl version
): 1.25.11uname -a
): 5.10.184-175.731.amzn2.x86_64Happy to provide more context and logs. The instance is still around.
/kind bug