Improve failure domains

fabriziopandini commented 6 months ago

Grouping a couple of issue/ideas about failure domain which are not getting attention from the community

To address this issue we need a proposal that looks into how to handle operations for failure domain (going behind the initial placement of machines currently supported)

https://github.com/kubernetes-sigs/cluster-api/issues/4031

Currently failure domains are assumed to be always available, or during an outage/issues with an AZ a KCP machine would still be created there. the short-term solution is to remove the AZ from the status, but this might be confusing as someone would see an AZ missing from the list for no apparent reason. As this is a breaking change, we'll likely want to defer it to v1alpha4

https://github.com/kubernetes-sigs/cluster-api/issues/5667

As a user/operator of a non-cloud provider cluster (e.g. baremetal), I would like CAPI to label Nodes with the well-known label that corresponds to the failuredomain that was selected by CAPI.

As a user/operator I would like to have more control over how CAPI balances my control-plane and worker nodes across failuredomains. For example, one of my failuredomains has less infra resources then the others; equal distribution, as is done today, would not work well for me.

As an operator who uses (or wants to use) cluster-autoscaler, I want CAPI failuredomains and cluster autoscaler to play nicely.

https://github.com/kubernetes-sigs/cluster-api/issues/7417

define how the system reacts to failure domain changes; this is a separate problem, but in kind of builds up on how we can identify failure domain is changed, so IMO the first point should be addressed first.

/kind feature One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

k8s-ci-robot commented 6 months ago

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

mdbooth commented 6 months ago

I have been giving this some thought recently specifically in the context of CAPO, but also with a view to how it could be implemented more generally. The two principal problems we have with the current implementation are:

In OpenStack specifically, a 'failure domain' can in practice be an arbitrarily complex set of configurations spanning separate configurations for at least compute, storage, and network. In order to use MachineSpec.FailureDomain we would effectively have to make this a reference to some other data structure. This dramatically increases complexity for both developers and users.

As failure domains are arbitrarily complex configuration, they can change over time. There is currently no component which can recognise that a machine is no longer compliant with its failure domain and perform some remediation.

In OpenShift we have the Control Plane Machine Set operator (CPMS). This works well for us, but this is because, being in OpenShift, it can take a number of liberties which are unlikely to be acceptable in CAPI, specifically the following are baked directly into the controller:

However, this is the extent of the provider-specific code in CPMS. It's quite a simple interface.

I had an idea that we might be able to borrow ideas from CPMS and the kube scheduler to implement something relatively simple but very flexible. What follows is very rough. It's intended for discussion rather than as a concrete design proposal.

The high level overview is that we would add a FailureDomainPolicyRef to MachineSpec. If a Machine has a FailureDomainPolicyRef, the Machine controller will not create an InfrastructureMachine until the MachineSpec also has a FailureDomainRef.

A user might create:

MachineTemplate:

spec:
  template:
    spec:
      ...
      failureDomainPolicyRef:
        apiVersion: ...
        kind: DefaultCAPIFailureDomainPolicy
        name: MyClusterControlPlane

DefaultCAPIFailureDomainPolicy:

metadata:
  name: MyClusterControlPlane
spec:
  spreadPolicy: Whatever
  failureDomains:
    apiVersion: ...
    kind: OpenStackFailureDomain
    names:
    - AZ1
    - AZ2
    - AZ3

OpenStackFailureDomain

metadata:
  name: AZ1
spec:
  computeAZ: Foo
  storageAZ: Bar
  networkAZ: Baz

If OpenStackFailureDomain is immutable, it can only be 'changed' by creating a new one and updating the failure domain policy.

The failure domain policy controller would watch Machines with a failureDomainPolicyRef. It would assign a failureDomain from the list according to the configured policy. It also has the opportunity to notice that a set of Machines is no longer compliant with the policy and remediate by deleting machines so new, compliant machines can replace them.

Because the failure domain is now a reference to a provider-specific CRD, the infrastructure machine controller can take provider-specific actions to apply the failure domain to an infrastructure machine.

For users who don't need this complexity, the infrastructure cluster controller could create a default policy much the way it does now which could be applied to a KCP machine template.

A design like this in the MachineSpec would also have the advantage that it could be used without modification for any set of machines. So, for example, users who want to spread a set of workers in an MD across 2 FDs would be able to do that.

JoelSpeed commented 6 months ago

I believe something like this would also be effective for vSphere, where failure domains are also complex as one cluster could in theory span multiple clusters. Not sure exactly how this is handled in CAPV today.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/10476#issuecomment-2379432189): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api

Improve failure domains #10476