Closed randomvariable closed 1 year ago
/area ls
@randomvariable: The label(s) area/ls
cannot be applied, because the repository doesn't have them.
/area topology /milestone Next /priority important-longterm
There is a question here around automation that needs to be dug into - possibly there's some stronger signal for the workloads that can be automatically consumed somewhere and feed into a lifecycle hook of some sort.
Interested folk may at some point want to work in the CNCF Telco WG and ETSI and define some standards.
/milestone v0.4
Currently the discussions of upgrade strategy suggest that queue placement will be derived from alphanumeric ordering based on name. This may work if there is an alignment between the role of a node and the upgrade strategy.
My feeling is that whilst this could be a sane default we should be able to sequence queue placement based on a customisable selector such as a label which can be specified within the managed topology of a given ClusterClass manifest. We could also include an 'ascending' bool which allows us to enforce the forwards/backwards relationship of an upgrade sequence.
Where there are specific hardware constraints or topological requirements (all the backend, then frontend, then app servers) it should be possible to apply reasonable logic to induce an upgrade without doing something unforeseen.
The requirements in this area can rely on very niche hardware constraints and relying on the Containerized Network Functions to be up and running (and warmed up as some require ramping on traffic etc.) before being able to take the old, unupgraded version online. The RolloutStrategy here could end up looking like a complicated Machine scheduler if we try to bake all of that into a controller.
Is there an easy-win use case to start building these requirements around? Or would an MVP look like some annotation or some other externally defined field that impacts machine rollout ordering?
I think the MVP with the fields/annos I had mentioned to enact some level of control would be enough.
At the application layer it is not uncommon to devise strategies with a combination of heml/kustomize/clonesets for the subset of use cases where components need to roll in a more granularly controlled order (though this precludes system components).
Currently the discussions of upgrade strategy suggest that queue placement will be derived from alphanumeric ordering based on name. This may work if there is an alignment between the role of a node and the upgrade strategy.
A clarification here. The control plane will be always upgraded first, then the machine deployments; the alphanumeric ordering is the default strategy that will be used to define a predictable order for upgrading machine deployments (but a machine deployment will never be upgraded before the control plane)
We should be able to sequence queue placement based on a customisable selector such as a label which can be specified within the managed topology of a given ClusterClass manifest.
This is a good suggestion, that, as soon as the default option is implemented, should be considered together with the idea of pausing before (or after) each machine deployment upgrade.
We could also include an 'ascending' bool which allows us to enforce the forwards/backwards relationship of an upgrade sequence.
I'm a little bit more conservative about this, because when a way to control order is provided, this can be achieved by the same mean. Also, introducing ascending/descending gives a relative meaning to the ordering itself, and this could be somehow confusing.
The requirements in this area can rely on very niche hardware constraints and relying on the Containerized Network Functions to be up and running (and warmed up as some require ramping on traffic etc.) before being able to take the old, unupgraded version online.
This is a slightly more complex use case, and IMO we should defer this topic to the wider discussion about lifecycle hooks, that would provide a mean for external controllers to plug in CAPI workflows, including upgrades.
Note: sequencing of MachineDeployments, is now defined by the order in Cluster.Topology.Workers.MachineDeployments, so the users has full control of it. The delta to complete this story is to introduce a mechanism for pausing in between machine deployments upgrades
/milestone v1.0
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The same consideration with the controlled MachineDeployments
update will apply to a multitenant cluster, where each of the tenants (team\departments\etc) owns a set of nodegroups (MachineDeployments
) and have its own maintenance window, in which their nodegroups can be updated.
I've asked a question in slack about this scenario and right now it is possible to do by adding cluster.x-k8s.io/paused
annotation on each of the MachineDeployments
, updating version
field in Cluster
spec and then removing this annotation from MachineDeployment
, which had to be updated.
/triage accepted
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
As Killian wrote on the PR this issue is not related to #7401
/help
@fabriziopandini: This request has been marked as needing help from a contributor.
Please ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the PR is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/lifecycle frozen
@fabriziopandini I think this issue has been adressed for now by https://github.com/kubernetes-sigs/cluster-api/issues/8100
WDYT?
/close
Agreed. With https://github.com/kubernetes-sigs/cluster-api/issues/8100 we have a first solution in place, and we can eventually iterate on a new issue as soon as we have some feedback/more detailed requirements
@fabriziopandini: Closing this issue.
User Story
As a cluster operator I would like to control the rollout of upgrades to my MachineDeployments when consuming ClusterClass such that I can verify the operation of my nodes and workloads. My workloads include network function virtualisation which are tied to particular hardware, requiring careful rollout to meet availability requirements for my telecoms network.
Detailed Description
Just to be clear, the person "I" is a fictional one, but is based on customers I've talked to.
In a cluster serving network functions, machines are not readily swappable even when virtualised as they are often tied to a particular piece of hardware, e.g. a Radio Access Network card providing connectivity to mobile phones at a cell site, GPUs, and other network accelerators. In this scenario, I will want to roll out one MachineDeployment, ensure everything is running OK, and then initiate the upgrade of the next Machine Deployment.
There is a question here around automation that needs to be dug into - possibly there's some stronger signal for the workloads that can be automatically consumed somewhere and feed into a lifecycle hook of some sort. The NFV software themselves may be of varying quality though, so there could be limits to the amount of health signal they give which might at the end of the day require manual intervention to roll out MachineDeployments.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
/kind feature