kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.6k stars 1.33k forks source link

BeforeClusterUpgrade is not called with cp unavailable #8942

Open franzudev opened 1 year ago

franzudev commented 1 year ago

What steps did you take and what happened?

Using ClusterClass we found an unexpected behavior when we try to upgrade the cluster and intercept the BeforeClusterUpgrade hook using RuntimeSDK. If a ControlPlane isn't in an healthy state the hook doesn't get propagated to the runtime extension, I think this is logically wrong and in contrast with the documentation.

What did you expect to happen?

To receive the BeforeClusterUpgrade request even if the ControlPlanes are unhealthy and manage the problem in the phase of upgrading.

Cluster API version

1.4.2+

Kubernetes version

1.27.2

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

killianmuldoon commented 1 year ago

AFAIK this is the intended design and behaviour.

From the docs:

This hook is called after the Cluster object has been updated with a new spec.topology.version by the user, and immediately before the new version is going to be propagated to the control plane

The topology controller will not update a control plane unless it is in a healthy state so "immediately before" means only if the version would otherwise be propagated. The wording could definitely be improved to clarify that it requires a healthy control plane.

Why do you think this is logically wrong?

killianmuldoon commented 1 year ago

/triage accepted

franzudev commented 1 year ago

Runtime Extension implementers can use this hook to execute pre-upgrade add-on tasks and block upgrades of the ControlPlane and Workers

because of this, if I need to apply my custom logic or block the ControlPlane upgrades, I'm not able to. I think the correct path would be

BeforeClusterUpgrade -> extension handles the requests -> CAPI checks nodes status -> if ok -> Upgrade start

and not

CAPI checks nodes status -> if ok -> BeforeClusterUpgrade -> extension handles the requests -> Upgrade start

because if I try to intercept this kind of hook, I want to be in charge of manage the upgrade and/or block the upgrade by myself, or being able to apply my add-on tasks to get the node in an healthy status before the upgrade, imho.

killianmuldoon commented 1 year ago

if I try to intercept this kind of hook, I want to be in charge of manage the upgrade and/or block the upgrade by myself, or being able to apply my add-on tasks to get the node in an healthy status before the upgrade, imho.

In this case the upgrade will always be blocked because the control plane isn't in a healthy state. This block supersedes the control plane block. Why is the node not in a healthy state prior to the upgrade? Is the health of the node somehow related to the upgrade?

franzudev commented 1 year ago

in some cases, but I don't think this is the point. The point is why I have to use BeforeClusterUpgrade if I can't apply my custom logic before the upgrade or I'm not able to block it because of my needs?

killianmuldoon commented 1 year ago

The point is why I have to use BeforeClusterUpgrade if I can't apply my custom logic before the upgrade or I'm not able to block it because of my needs?

The hook is called specifically at the last moment before the upgrade is initiated so you should be able to block it or apply custom logic before the upgrade - is the hook being skipped and the upgrade preceding in this case?

Making sure the control plane is healthy is the role of the control plane provider, so if there's something wrong with a node it may be able to remediate to fix the state of the control plane.

If I'm understanding your case correctly you would like the hook to be called even though the upgrade will be subsequently blocked by something else in the Cluster topology controller - which is also a confusing outcome for users.

I think the specifics of your case are important here - why is the control plane unhealthy and what needs to be done to fix it?

franzudev commented 1 year ago

I'm using the konnectivity service for my nodes and are not registered in the api server, the cluster still function correctly, but CAPI not. For me it makes no sense that I get control over the lifecycle if I can't really control it. IMHO, the control plane provider needs to check cp status after user's(extension) intervention and before upgrade, because what happen If I break a cp during the runtime extension's execution?

killianmuldoon commented 1 year ago

IMHO, the control plane provider needs to check cp status after user's(extension) intervention and before upgrade, because what happen If I break a cp during the runtime extension's execution?

In general the role of the runtime extension shouldn't be to fix a control plane if it's broken in some way. The recommended way to resolve problems with nodes is to use MachineHealthChecks.

When you say the nodes aren't registered with APIServer you mean that not all control plane nodes have joined the cluster? Are you trying to fix this with a runtime extension?

vincepri commented 1 year ago

I'm using the konnectivity service for my nodes and are not registered in the api server, the cluster still function correctly, but CAPI not. For me it makes no sense that I get control over the lifecycle if I can't really control it.

What would the hook do in the case where the control plane is unhealthy before upgrading?

franzudev commented 1 year ago

Regardless of the status of the nodes, I expect to receive the event. That doesn't happen, if the user update the version of the Cluster I need to know it, the control logic behind doesn't matter to me, but if I opt in for the Runtime SDK I need to get these events.

vincepri commented 1 year ago

We might need to update the current documentation, BeforeClusterUpgrade currently assumes that the cluster is healthy; although I might see a use case where one might always want the call to the extension to go through.

We have a couple of options, in general though it seems like we might either need a different hook or rework the current assumption to take into account of unhealthy clusters.

fabriziopandini commented 7 months ago

/priority backlog

fabriziopandini commented 6 months ago

For someone to look into + find a way forward. But I don't think we should mix up runtime extensions with health/check remediation /help

k8s-ci-robot commented 6 months ago

@fabriziopandini: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/8942): >For someone to look into + find a way forward. >But I don't think we should mix up runtime extensions with health/check remediation >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.