Invalid format for AWS instance

doryer commented 10 months ago

What happened:

We've upgraded to k8s 1.25.15 and installed aws CCM in our cluster using kOps, since the upgrade we periodically see those error logs from aws CCM:

E1102 13:36:37.503965 1 node_lifecycle_controller.go:185] error checking if node <instance-id> is shutdown: Invalid format for AWS instance () }

We see that when node going up and joins the cluster, the node becomes ready and it not seems it effects the node scheduling but we still not sure why we see those Errors

What you expected to happen:

Not seeing erros on those nodes or more details error log

How to reproduce it (as minimally and precisely as possible):

Running cloud controller manager with version v1.25.12 in k8s 1.25.15

Environment:

Kubernetes version (use kubectl version): v1.25.15
Cloud provider or hardware configuration: AWS EC2,
OS (e.g. from /etc/os-release): Ubuntu 20.04.6
Kernel (e.g. uname -a): 5.15.0-1037-aws
Install tools: kOps
Others:

/kind bug

k8s-ci-robot commented 10 months ago

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

cartermckinnon commented 10 months ago

What does the node's provider ID look like?

doryer commented 10 months ago

What does the node's provider ID look like?

aws:///eu-west-1a/

cartermckinnon commented 10 months ago

That's definitely the problem. Do you know what's setting the provider ID on your nodes?

doryer commented 10 months ago

That's definitely the problem. Do you know what's setting the provider ID on your nodes?

We're running the cluster with kOps so from docs seems like Node controller inside kops-controller is the one who applying this to the k8s node: https://kops.sigs.k8s.io/architecture/kops-controller/. It is also contains the instance-id so it looks like aws:///eu-west-1a/ Anyway, what should it be to be valid?

cartermckinnon commented 10 months ago

The log message you included looks like the provider wasn't defined at all (name is blank): https://github.com/kubernetes/cloud-provider-aws/blob/a1eb96d8ee3baffa8450e870c7360afa6ca836d2/pkg/providers/v1/instances.go#L82

I don't know how the provider ID is set in a kops cluster.

/kind support

doryer commented 10 months ago

The log message you included looks like the provider wasn't defined at all (name is blank):

https://github.com/kubernetes/cloud-provider-aws/blob/a1eb96d8ee3baffa8450e870c7360afa6ca836d2/pkg/providers/v1/instances.go#L82

I don't know how the provider ID is set in a kops cluster.

/kind support

Ok so it helped me understand the issue. providerID is a field being added by kops-controller to every node that joins the cluster. seems like we saw for each instance that joined the cluster the error log a moment before he joins the cluster, so probably the node lifecycle controller is checking the node before the providerID being added. after kops-controller adding the providerID to the node the errors disappeared. Maybe adding retries to the node lifecycle controller by checking the providerID to environments managed by kOps can solve the issue

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 5 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/cloud-provider-aws/issues/724#issuecomment-2041133952): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

yogeek commented 2 months ago

@cartermckinnon is it possible to /reopen this issue please ? We are seeing the same error messages in our clusters deployed with kubeadm and it seems related to the explanation above (race condition between the tool setting the providerID on the node - kops or kubelet - and the node controller)

dims commented 2 months ago

@yogeek do you mind opening a fresh issue with exact details from your clusters? (then cross link this issue from there)

kubernetes / cloud-provider-aws

Invalid format for AWS instance #724