Closed antonipp closed 7 months ago
This issue is currently awaiting triage.
If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Forgot to mention but there is a workaround which is to explicitly set providerID
in the Kubelet Configuration to force it to use the proper instance id.
/kind bug @bobbypage @jprzychodzen who can be a good person to triage this issue?
/assign
This might be hard to change without breaking something. Does it make more sense to expose the UID in a separate method and for your software to call that instead?
This might be hard to change without breaking something
Hm, yeah makes sense, I just had a second look at the codebase and realized that all functions like this which use instanceByProviderID()
will need to be updated too in order to account for a different ID... But the scope of changes is not that big either.
Does it make more sense to expose the UID in a separate method and for your software to call that instead?
Not sure because the software I was thinking about was the Kubelet which calls the cloudprovider.GetInstanceProviderID()
function in kubelet_node_status.go and this in turns calls instances.InstanceID()
here. This is a generic interface implemented by multiple Cloud Provider libraries so if we create a new function, we will need to somehow call it from there and this won't be compatible with other Cloud Provider implementations.
One more note -- the instance id (i.e. the UID from GCE instance) is already exposed on the node object as an annotation and here where it is set.
Does that work for your use case?
One more note -- the instance id (i.e. the UID from GCE instance) is already exposed on the node object as an annotation and here where it is set.
Does that work for your use case?
This is something that we already rely on actually... To give a bit more context, we have a custom bash script that runs on each node before the kubelet has started and uses this annotation to check if the node that is currently booting re-uses a previously used hostname: it compares the current node UID from the IMDS with the UID in the annotation of the node object if such an object already exists.
I was hoping that we could stop relying on this custom logic and that we could include this check into the kubelet directly, for example here when the kubelet tries to register the node with the API server. Since a similar node name re-use problem happens to us in AWS, it would've been nice to write a Cloud-Provider agnostic check directly in the kubelet:
if existingNode.Spec.ProviderID != node.Spec.ProviderID {
// Delete existing Node
// Try again
}
This would work for AWS where the ProviderID
is set to the actual UID of the instance (ex: i-0513b0ff1ebf9342d
) but unfortunately doesn't work for GCP where the ProviderID
is not unique :/
I was hoping that we could stop relying on this custom logic and that we could include this check into the kubelet directly, for example here when the kubelet tries to register the node with the API server.
We have somewhat similar logic today actually... When the kubelet attempts to register, we check if there already exists a node object in the api-server with the same name. If it does, and the instance id differs (based on looking at the annotation), we delete the old node object. See this logic in shouldDeleteNode
which is called as part of the node registration process.
Yeah, I already came across this code but I really wanted to implement a cloud-agnostic solution directly in the kubelet since we have a very similar problem in AWS... That code is too GCP-specific and is also part of a separate controller which not everybody necessarily runs.
/cc @cezarygerard
Outside of GCP/AWS this field may not be set at all and the contents of this field aren't strongly formatted or guaranteed for external users AFAIK.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
The
InstanceID
function defined in gce_instances.go does not return the actual "Instance ID" as it is defined by GCP (https://cloud.google.com/compute/docs/instances/get-instance-id#api). Instead, the function returns the "Node Name" which, as described in nodename.go, is the "Name of an Instance object in the GCE API" and is the same as the hostname.However, this name can not be used as a unique identifier for an instance. One example of where this breaks is the Managed Instance Group "Auto-heal / Auto-repair" feature: when a VM fails, the MIG will automatically re-create it with the same name but it will not actually be the same instance (the local data will be gone, the IP will change, etc).
An easy way to reproduce it:
Note that the ID has changed whereas the name stayed the same!
This proved to be quite problematic in our Kubernetes setup because Kubernetes will still think that the instance is the same whereas it was changed under the hood.
So my proposal in order to be able to actually distinguish the instances would be to modify the code to something like this:
This would allow us to reliably detect when these events happen by using the actual unique instance id instead of the name which is not unique.
What do you think?