Open franzudev opened 1 year ago
AFAIK this is the intended design and behaviour.
From the docs:
This hook is called after the Cluster object has been updated with a new spec.topology.version by the user, and immediately before the new version is going to be propagated to the control plane
The topology controller will not update a control plane unless it is in a healthy state so "immediately before" means only if the version would otherwise be propagated. The wording could definitely be improved to clarify that it requires a healthy control plane.
Why do you think this is logically wrong?
/triage accepted
Runtime Extension implementers can use this hook to execute pre-upgrade add-on tasks and block upgrades of the ControlPlane and Workers
because of this, if I need to apply my custom logic or block the ControlPlane upgrades, I'm not able to. I think the correct path would be
BeforeClusterUpgrade -> extension handles the requests -> CAPI checks nodes status -> if ok -> Upgrade start
and not
CAPI checks nodes status -> if ok -> BeforeClusterUpgrade -> extension handles the requests -> Upgrade start
because if I try to intercept this kind of hook, I want to be in charge of manage the upgrade and/or block the upgrade by myself, or being able to apply my add-on tasks to get the node in an healthy status before the upgrade, imho.
if I try to intercept this kind of hook, I want to be in charge of manage the upgrade and/or block the upgrade by myself, or being able to apply my add-on tasks to get the node in an healthy status before the upgrade, imho.
In this case the upgrade will always be blocked because the control plane isn't in a healthy state. This block supersedes the control plane block. Why is the node not in a healthy state prior to the upgrade? Is the health of the node somehow related to the upgrade?
in some cases, but I don't think this is the point. The point is why I have to use BeforeClusterUpgrade if I can't apply my custom logic before the upgrade or I'm not able to block it because of my needs?
The point is why I have to use BeforeClusterUpgrade if I can't apply my custom logic before the upgrade or I'm not able to block it because of my needs?
The hook is called specifically at the last moment before the upgrade is initiated so you should be able to block it or apply custom logic before the upgrade - is the hook being skipped and the upgrade preceding in this case?
Making sure the control plane is healthy is the role of the control plane provider, so if there's something wrong with a node it may be able to remediate to fix the state of the control plane.
If I'm understanding your case correctly you would like the hook to be called even though the upgrade will be subsequently blocked by something else in the Cluster topology controller - which is also a confusing outcome for users.
I think the specifics of your case are important here - why is the control plane unhealthy and what needs to be done to fix it?
I'm using the konnectivity service for my nodes and are not registered in the api server, the cluster still function correctly, but CAPI not. For me it makes no sense that I get control over the lifecycle if I can't really control it. IMHO, the control plane provider needs to check cp status after user's(extension) intervention and before upgrade, because what happen If I break a cp during the runtime extension's execution?
IMHO, the control plane provider needs to check cp status after user's(extension) intervention and before upgrade, because what happen If I break a cp during the runtime extension's execution?
In general the role of the runtime extension shouldn't be to fix a control plane if it's broken in some way. The recommended way to resolve problems with nodes is to use MachineHealthChecks.
When you say the nodes aren't registered with APIServer you mean that not all control plane nodes have joined the cluster? Are you trying to fix this with a runtime extension?
I'm using the konnectivity service for my nodes and are not registered in the api server, the cluster still function correctly, but CAPI not. For me it makes no sense that I get control over the lifecycle if I can't really control it.
What would the hook do in the case where the control plane is unhealthy before upgrading?
Regardless of the status of the nodes, I expect to receive the event. That doesn't happen, if the user update the version of the Cluster I need to know it, the control logic behind doesn't matter to me, but if I opt in for the Runtime SDK I need to get these events.
We might need to update the current documentation, BeforeClusterUpgrade
currently assumes that the cluster is healthy; although I might see a use case where one might always want the call to the extension to go through.
We have a couple of options, in general though it seems like we might either need a different hook or rework the current assumption to take into account of unhealthy clusters.
/priority backlog
For someone to look into + find a way forward. But I don't think we should mix up runtime extensions with health/check remediation /help
@fabriziopandini: This request has been marked as needing help from a contributor.
Please ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
What steps did you take and what happened?
Using
ClusterClass
we found an unexpected behavior when we try to upgrade the cluster and intercept theBeforeClusterUpgrade
hook using RuntimeSDK. If a ControlPlane isn't in an healthy state the hook doesn't get propagated to the runtime extension, I think this is logically wrong and in contrast with the documentation.What did you expect to happen?
To receive the
BeforeClusterUpgrade
request even if the ControlPlanes are unhealthy and manage the problem in the phase of upgrading.Cluster API version
1.4.2+
Kubernetes version
1.27.2
Anything else you would like to add?
No response
Label(s) to be applied
/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.