chess-knight commented 6 months ago

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened: This issue follows a discussion in https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1605, as a request to automatically label nodes with underlying hostId information, so e.g. workload can be scheduled on different physical hosts. It can be used as a topology differentiator when all other topology labels are the same.

Anything else we need to know?: Issue https://github.com/kubernetes/cloud-provider/issues/67 is closed and potentially resolved by https://github.com/kubernetes/kubernetes/pull/123223 so now based on comment https://github.com/kubernetes/cloud-provider/issues/67#issuecomment-2003176563, AdditionalLabels with hostId information can be added to InstanceMetadata: https://github.com/kubernetes/cloud-provider-openstack/blob/dab0f067249eea1051e3bb57ed05d2cc3e326aa8/pkg/openstack/instancesv2.go#L136-L142 This information should be available in the server struct AFAIK, so there should not be too much work I think now.

One potential issue that I see is live migration, e.g. see https://github.com/kubernetes/cloud-provider-openstack/issues/1801, where occm will have to update the label because hostId will change.

Environment:

openstack-cloud-controller-manager(or other related binary) version:
OpenStack version:
Others:

dulek commented 6 months ago

This seems like a valid request from my point of view. @mdbooth, what do you think?

@chess-knight: Are you planning to contribute implementation?

dulek commented 6 months ago

@gryf, @stephenfin: This sounds like a low hanging fruit you can grab to get up to speed with CPO code.

zetaab commented 5 months ago

is hostid really available to normal user in openstack? I do not have access now to openstack, but if I remember correctly normal user cannot see that?

chess-knight commented 5 months ago

According to our research in https://github.com/SovereignCloudStack/issues/issues/540, hostId should be available for all users. You can see e.g. in the nova code, that it is supported from API version 2.62 https://opendev.org/openstack/nova/commit/c2f7d6585818c04e626aa4b6c292e5c2660cb8b3. hostId is different from host. The host is available only to admins but hostId is a hash of (project_id + host) so it can be available to all.

zetaab commented 5 months ago

actually I can see both hostId and host_id which has same value. Perhaps that label could be added if it exists. This can be added in https://github.com/kubernetes/cloud-provider-openstack/blob/master/pkg/openstack/instancesv2.go#L136-L142 as AdditionalLabels https://github.com/kubernetes/cloud-provider/blob/d2f5e75a5fd6d31f75b1519fe20879b1ab5347b8/cloud.go#L300

chess-knight commented 5 months ago

actually I can see both hostId and host_id which has same value. Perhaps that label could be added if it exists. This can be added in https://github.com/kubernetes/cloud-provider-openstack/blob/master/pkg/openstack/instancesv2.go#L136-L142 as AdditionalLabels https://github.com/kubernetes/cloud-provider/blob/d2f5e75a5fd6d31f75b1519fe20879b1ab5347b8/cloud.go#L300

Yes, exactly as I originally wrote in the issue, thanks. As @dulek said, it shouldn't be hard to implement this, but IMO discussion is needed here, on what should/will happen when live migration happens. When we initially set this custom label, it may not be correct later because the underlying host might change. So, should we consider this custom label only as "initial_host_id"? Or is there anything that we can do to reconcile it?

zetaab commented 5 months ago

we are not using live migrations at all, so difficult to say how it works.

mdbooth commented 5 months ago

IMHO the node controller should update node labels to reflect their current reality, i.e. a live migration will trigger node relabelling the next time the node is reconciled. Most things currently running on the Node are unlikely to act on it, but:

Newly scheduled workloads can act on it
Existing workloads at least have the possibility to notice that they're in the wrong place now
We should document the known limitations and recommend an alternative (node reprovisioning)

We should also validate this decision with the cloud-provider folks in case there are caveats we're not aware of.

mdbooth commented 5 months ago

I'm very much in favour of adding a HostID label. I can confirm that HostID is unprivileged. It's an opaque value which can't be used to determine anything about the host including its name.

artificial-intelligence commented 5 months ago

I'm very much in favour of adding a HostID label. I can confirm that HostID is unprivileged. It's an opaque value which can't be used to determine anything about the host including its name.

Hi,

could you - or anybody - please clarify: You say this HostID can't be used to determine anything about the host, but from the name I would suppose it's some kind of unique(?) distinct value per host, no?

So can't I at least infer, that when the HostID changes, the underlying host has changed?

If the answer to the above is "no", so you can't infer anything from this ID, I don't see where adding it would bring any benefit, at least for our use case in the Sovereign Cloudstack project.

So any clarification around this would be highly appreciated, thanks!

mdbooth commented 5 months ago

I'm very much in favour of adding a HostID label. I can confirm that HostID is unprivileged. It's an opaque value which can't be used to determine anything about the host including its name.

Hi,

could you - or anybody - please clarify: You say this HostID can't be used to determine anything about the host, but from the name I would suppose it's some kind of unique(?) distinct value per host, no?

It's a sha224 of project_id and hostname: https://github.com/openstack/nova/blob/7dc4b1ea627d864a0ee2745cc9de4336fc0ba7b5/nova/utils.py#L1028-L1043

So hostID can't be compared between tenants.

So can't I at least infer, that when the HostID changes, the underlying host has changed?

If the answer to the above is "no", so you can't infer anything from this ID, I don't see where adding it would bring any benefit, at least for our use case in the Sovereign Cloudstack project.

So any clarification around this would be highly appreciated, thanks!

@stephenfin may be able to confirm that hostID will change if a VM live migrates, but I'm pretty sure that it would.

In general, k8s isn't going to handle a live migration well because, in general, we don't continuously reconcile the placement of things which have already been scheduled.

I think the value of HostID to a kubernetes cluster is the ability to schedule Pods on different underlying hypervisors. This means that an end-user can ensure their workload can survive a maintenance outage of a single hypervisor.

chess-knight commented 5 months ago

I think the value of HostID to a kubernetes cluster is the ability to schedule Pods on different underlying hypervisors. This means that an end-user can ensure their workload can survive a maintenance outage of a single hypervisor.

I am thinking of the following scenario:

OCCM will add the correct hostId label when the node starts
Live migration happens
hostId label is incorrect now so worker nodes with different hostId can be placed on the same host. And when the user uses this incorrect label, the pods can end up on the same host.

mdbooth commented 5 months ago

I think the value of HostID to a kubernetes cluster is the ability to schedule Pods on different underlying hypervisors. This means that an end-user can ensure their workload can survive a maintenance outage of a single hypervisor.

I am thinking of the following scenario:

OCCM will add the correct hostId label when the node starts

Live migration happens

hostId label is incorrect now so worker nodes with different hostId can be placed on the same host. And when the user uses this incorrect label, the pods can end up on the same host.

I expect it to be updated.

However, live migrating k8s hosts already violates the scheduling constraints of everything which was running on it. Live migrating k8s workers is not a good idea if it can be avoided. Simply draining the node and shutting it down during maintenance is preferrable.

chess-knight commented 5 months ago

I think the value of HostID to a kubernetes cluster is the ability to schedule Pods on different underlying hypervisors. This means that an end-user can ensure their workload can survive a maintenance outage of a single hypervisor.

I am thinking of the following scenario:

OCCM will add the correct hostId label when the node starts

Live migration happens

hostId label is incorrect now so worker nodes with different hostId can be placed on the same host. And when the user uses this incorrect label, the pods can end up on the same host.

However, live migrating k8s hosts already violates the scheduling constraints of everything which was running on it. Live migrating k8s workers is not a good idea if it can be avoided. Simply draining the node and shutting it down during maintenance is preferrable.

I agree with you, that live-migrating k8s node is not a good idea.

I expect it to be updated.

But which controller should be responsible for that? Right now, I am not aware of any.

mdbooth commented 5 months ago

But which controller should be responsible for that? Right now, I am not aware of any.

This PR is requesting that OpenStack CCM sets it, so OpenStack CCM would also be responsible for updating it. IIRC there is now a mechanism for returning arbitrary node labels, but I don't recall what it is.

chess-knight commented 4 months ago

But which controller should be responsible for that? Right now, I am not aware of any.

This PR is requesting that OpenStack CCM sets it, so OpenStack CCM would also be responsible for updating it. IIRC there is now a mechanism for returning arbitrary node labels, but I don't recall what it is.

Do you mean AdditionalLabels in the InstanceMetadata mentioned in this issue or something else? If so, I still don't know how these labels can be updated.

chess-knight commented 3 months ago

I created PR so the discussion can move on. Can someone try it, please? You can use registry.scs.community/occm-rh/openstack-cloud-controller-manager:v1.30.0-8-ga00ee1d8 image.

chess-knight commented 3 months ago

2628 is approved. Should we merge it immediately and close this issue or if someone wants to look at it? I am not able to test migrations where host-id will change. I tried only deletion of the label with kubectl command and this additional label is not reconciled back into the place, so I assume that live migration will have the same effect(wrong host-id label after).

Maybe we can just document this new host-id label with my observations and proceed...

kayrus commented 3 months ago

@chess-knight I agree, the label must be reconciled and updated once the node is live-migrated.

chess-knight commented 3 months ago

@chess-knight I agree, the label must be reconciled and updated once the node is live-migrated.

I am not sure if OCCM is capable of doing that. Maybe after all we should go back into the original issue and implement it in the CAPO, where machine reconciliation happens(I hope so). CAPO can introduce e.g. node.cluster.x-k8s.io/host-id label, set it on the Machine object and it will be propagated to the k8s Node labels. Or just document this limitation, something like that host-id can serve as the "initial" host-id without any guarantees in the future.

mdbooth commented 2 months ago

@chess-knight As well as reconciling new Nodes, the node controller resyncs nodes periodically: https://github.com/kubernetes/kubernetes/blob/03fe89c2339a1582733649faab5f5df471f65f09/staging/src/k8s.io/cloud-provider/controllers/node/node_controller.go#L191-L198

However, it looks like that job:

Ignores tainted nodes. These are handled separately by sync(), which sets all values.
Only sets Addresses

It sounds like if we want to continuously reconcile zone information that should be a discussion with the cloud-provider folks. Maybe @aojea can let us know if this has been discussed before, and if not the best place to start the discussion.

My view: Kubernetes doesn't expect zone information to change, and in general will not respond to changes in zone information. We should advise users that there are alternatives which will give better behaviour. Despite that, zone information can still change, which means it will occasionally change. An example is a managed cloud service where the user has no influence over the migration of workloads. By updating the zone information on the Node when it does change we:

at least make it detectable
allow new workloads to be created in the correct place I believe we should continuously reconcile zone information, including node labels.

For now, this is an edge case. Lets return HostID in the instance metadata as is done by https://github.com/kubernetes/cloud-provider-openstack/pull/2628. This is an immediate win for anybody wanting to schedule with hypervisor anti-affinity. The problem of continuous reconciliation is somewhat independent as it covers more than just the HostID label.

chess-knight commented 2 months ago

Hi @mdbooth, thank you for your review.

@chess-knight As well as reconciling new Nodes, the node controller resyncs nodes periodically: https://github.com/kubernetes/kubernetes/blob/03fe89c2339a1582733649faab5f5df471f65f09/staging/src/k8s.io/cloud-provider/controllers/node/node_controller.go#L191-L198

However, it looks like that job:

Ignores tainted nodes. These are handled separately by sync(), which sets all values.

Only sets Addresses

Interestingly, the comment suggests reconcile the nodes addresses and labels, not only addresses.

It sounds like if we want to continuously reconcile zone information that should be a discussion with the cloud-provider folks. Maybe @aojea can let us know if this has been discussed before, and if not the best place to start the discussion.

My view: Kubernetes doesn't expect zone information to change, and in general will not respond to changes in zone information. We should advise users that there are alternatives which will give better behaviour. Despite that, zone information can still change, which means it will occasionally change. An example is a managed cloud service where the user has no influence over the migration of workloads. By updating the zone information on the Node when it does change we:

at least make it detectable

allow new workloads to be created in the correct place I believe we should continuously reconcile zone information, including node labels.

For now, this is an edge case. Lets return HostID in the instance metadata as is done by #2628. This is an immediate win for anybody wanting to schedule with hypervisor anti-affinity. The problem of continuous reconciliation is somewhat independent as it covers more than just the HostID label.

I agree, that updating labels needs to be discussed. Maybe it can be configurable on/off behaviour. Do you think that I should write also some docs about the HostID label, so users are aware of it?

kubernetes / cloud-provider-openstack

[occm] Add Openstack server hostId as k8s node label #2579