Closed doryer closed 5 months ago
This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
What does the node's provider ID look like?
What does the node's provider ID look like?
aws:///eu-west-1a/
That's definitely the problem. Do you know what's setting the provider ID on your nodes?
That's definitely the problem. Do you know what's setting the provider ID on your nodes?
We're running the cluster with kOps so from docs seems like Node controller inside kops-controller is the one who applying this to the k8s node: https://kops.sigs.k8s.io/architecture/kops-controller/. It is also contains the instance-id so it looks like aws:///eu-west-1a/ Anyway, what should it be to be valid?
The log message you included looks like the provider wasn't defined at all (name
is blank): https://github.com/kubernetes/cloud-provider-aws/blob/a1eb96d8ee3baffa8450e870c7360afa6ca836d2/pkg/providers/v1/instances.go#L82
I don't know how the provider ID is set in a kops cluster.
/kind support
The log message you included looks like the provider wasn't defined at all (
name
is blank):I don't know how the provider ID is set in a kops cluster.
/kind support
Ok so it helped me understand the issue. providerID is a field being added by kops-controller to every node that joins the cluster. seems like we saw for each instance that joined the cluster the error log a moment before he joins the cluster, so probably the node lifecycle controller is checking the node before the providerID being added. after kops-controller adding the providerID to the node the errors disappeared. Maybe adding retries to the node lifecycle controller by checking the providerID to environments managed by kOps can solve the issue
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
@cartermckinnon is it possible to /reopen this issue please ? We are seeing the same error messages in our clusters deployed with kubeadm and it seems related to the explanation above (race condition between the tool setting the providerID on the node - kops or kubelet - and the node controller)
@yogeek do you mind opening a fresh issue with exact details from your clusters? (then cross link this issue from there)
What happened:
We've upgraded to k8s 1.25.15 and installed aws CCM in our cluster using kOps, since the upgrade we periodically see those error logs from aws CCM:
E1102 13:36:37.503965 1 node_lifecycle_controller.go:185] error checking if node <instance-id> is shutdown: Invalid format for AWS instance () }
We see that when node going up and joins the cluster, the node becomes ready and it not seems it effects the node scheduling but we still not sure why we see those Errors
What you expected to happen:
Not seeing erros on those nodes or more details error log
How to reproduce it (as minimally and precisely as possible):
Running cloud controller manager with version v1.25.12 in k8s 1.25.15
Environment:
kubectl version
): v1.25.15uname -a
): 5.15.0-1037-aws/kind bug