kubernetes / cloud-provider-aws

Cloud provider for AWS
https://cloud-provider-aws.sigs.k8s.io/
Apache License 2.0
381 stars 300 forks source link

Finding the ProviderID fails when using a custom domain in the dhcp options #799

Closed cl-rf closed 7 months ago

cl-rf commented 8 months ago

The provider ID fails when using a custom domain in the dhcp options for a vpc.

What happened: I'm using a custom domain name in my dhcp options and it does not work when using either the resource id or ip name.

Resource name: I1223 03:54:46.332933 1 node_controller.go:427] Initializing node i-01bc69e5e8f27e8af.aws.example.com with cloud provider I1223 03:54:46.333105 1 log_handler.go:27] AWS request: ec2 DescribeInstances I1223 03:54:46.686666 1 log_handler.go:32] AWS API Send: ec2 DescribeInstances &{DescribeInstances POST / 0xc000338f00 <nil>} { InstanceIds: ["i-01bc69e5e8f27e8af.aws.example.com"] } I1223 03:54:46.686737 1 log_handler.go:37] AWS API ValidateResponse: ec2 DescribeInstances &{DescribeInstances POST / 0xc000338f00 <nil>} { InstanceIds: ["i-01bc69e5e8f27e8af.aws.example.com"] } 400 Bad Request E1223 03:54:46.687021 1 node_controller.go:236] error syncing 'i-01bc69e5e8f27e8af.aws.example.com': failed to get provider ID for node i-01bc69e5e8f27e8af.aws.example.com at cloudprovider: failed to get instance ID from cloud provider: getInstanceByNodeName failed for "i-01bc69e5e8f27e8af.aws.example.com" with "error listing AWS instances: \"InvalidInstanceID.Malformed: Invalid id: \\\"i-01bc69e5e8f27e8af.aws.example.com\\\"\\n\\tstatus code: 400, request id: 6f3ff90c-0460-45d5-9eef-f386faa37a3a\"", requeuing

IP name: I1223 02:08:08.026285 1 node_controller.go:427] Initializing node ip-172-31-15-250.example.com with cloud provider I1223 02:08:08.026355 1 aws.go:5089] Unable to convert node name "ip-172-31-15-250.example.com" to aws instanceID, fall back to findInstanceByNodeName: node has no providerID I1223 02:08:08.026457 1 log_handler.go:27] AWS request: ec2 DescribeInstances I1223 02:08:08.128615 1 log_handler.go:32] AWS API Send: ec2 DescribeInstances &{DescribeInstances POST / 0xc0002da190 <nil>} { Filters: [{ Name: "private-dns-name", Values: ["ip-172-31-15-250.example.com"] },{ Name: "instance-state-name", Values: [ "pending", "running", "shutting-down", "stopping", "stopped" ] }], MaxResults: 1000 } I1223 02:08:08.128691 1 log_handler.go:37] AWS API ValidateResponse: ec2 DescribeInstances &{DescribeInstances POST / 0xc0002da190 <nil>} { Filters: [{ Name: "private-dns-name", Values: ["ip-172-31-15-250.example.com"] },{ Name: "instance-state-name", Values: [ "pending", "running", "shutting-down", "stopping", "stopped" ] }], MaxResults: 1000 } 200 OK E1223 02:08:08.128918 1 node_controller.go:236] error syncing 'ip-172-31-15-250.example.com': failed to get provider ID for node ip-172-31-15-250.example.com at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing

What you expected to happen: I expect that at least the resource id would work.

How to reproduce it (as minimally and precisely as possible): Setup a vpc with custom dhcp options using a custom domain.

Anything else we need to know?: It looks like the resource id naming would work if just the "i-xxxxxxxx" was extracted from the hostname.

getInstanceByNodeName failed for "i-01bc69e5e8f27e8af.aws.example.com" with "error listing AWS instances:

Something like

diff --git a/pkg/providers/v1/aws.go b/pkg/providers/v1/aws.go

@@ -5124,6 +5124,7 @@ func nodeNameToIPAddress(nodeName string) string {

func (c *Cloud) nodeNameToInstanceID(nodeName types.NodeName) (InstanceID, error) { if strings.HasPrefix(string(nodeName), rbnNamePrefix) { +nodeName = strings.Split(nodeName, ".")[0] return InstanceID(nodeName), nil

Environment: RKE2 - v1.26.10+rke2r2 RHEL 9.3

/kind bug

k8s-ci-robot commented 8 months ago

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
cartermckinnon commented 8 months ago

There’s a few quirks when using a custom domain name in your DHCP options, because it will only be reflected in the instance’s PrivateDnsName (the filter being used to map a Node to an EC2 instance) if one of these DNS attributes of your VPC is disabled: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-support

The PrivateDnsName in the EC2 API is not always the same as the hostname from the OS (which is what kubelet will use as its node name by default).

In general, I would recommend just passing the provider ID to kubelet yourself, instead of relying on the node name -> EC2 instance lookup, as the former is much more efficient.

cartermckinnon commented 8 months ago

/kind support

cl-rf commented 8 months ago

I understand that manually specifying it would work but there is code to try to determine it would having to manually set it. The code assumes that if the hostname starts with "i-", it assumes it is the instance id. It seems like just stripping off the rest of the nodename would solve most issues, when using the the resource name in the subnet settings. Is there a reason that something like the following wouldn't work?

+++ b/pkg/providers/v1/aws.go
@@ -5124,6 +5124,7 @@ func nodeNameToIPAddress(nodeName string) string {
func (c *Cloud) nodeNameToInstanceID(nodeName types.NodeName) (InstanceID, error) {
  if strings.HasPrefix(string(nodeName), rbnNamePrefix) {
+   nodeName = strings.Split(nodeName, ".")[0]
     return InstanceID(nodeName), nil
}
cartermckinnon commented 8 months ago

Hm refreshing my memory -- previously we expected just the instance ID when resource-based naming was used, but @tzneal fixed that in: https://github.com/kubernetes/cloud-provider-aws/commit/dce57f7ca9a537506f1f8a6e069497973dc475de

Shouldn't that cover this scenario? That commit is in 1.27+ currently, feel free to cherry-pick it to 1.26 if that would fix this for you

cl-rf commented 8 months ago

Actually, yeah, I was looking in 1.27.1 and it looks like that was changed between 1.27.1 and 1.272.

cl-rf commented 8 months ago

In the compatibility matrix, I am using 1.26 and it lists 1.26.1 as the version I should be using. For 1.27, it lists 1.27.1. The latest version of cloud-provider-aws for 1.26.9 and does not have that change. If I use 1.27.2, would it still be compatible with kubernetes 1.26? It does not look like it based on the support matrix and does not look like 1.27.2 should be used.

cartermckinnon commented 8 months ago

The matrix is just out of date, sorry about that. You should always use the latest CCM release for a given k8s minor version, I'll get the README updated.

cl-rf commented 8 months ago

Ok, I tested kubernetes 2.6.11 and it does work using 1.27.2 when I create a new cluster. It doesn't match the version though. Is the fix going to be backported to the 1.26.x branch of cloud-provider-aws?

cartermckinnon commented 8 months ago

I think it makes sense to cherry-pick this, we haven't heard of it causing any breakage. Feel free to open a PR 👍

cartermckinnon commented 7 months ago

/close

k8s-ci-robot commented 7 months ago

@cartermckinnon: Closing this issue.

In response to [this](https://github.com/kubernetes/cloud-provider-aws/issues/799#issuecomment-1879265769): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.