metal3-io / cluster-api-provider-metal3

Metal³ integration with https://github.com/kubernetes-sigs/cluster-api
Apache License 2.0
208 stars 87 forks source link

Support for Failure Domains in CAPM3 #402

Open Arvinderpal opened 2 years ago

Arvinderpal commented 2 years ago

User Story

As an operator who has placed their baremetal infrastructure across different failure domains (FDs), I would like CAPM3 to associate Nodes with BMHs from the desired failure domain.

Detailed Description

CAPI supports failure domains for both control-plane and worker nodes (see CAPI provider contract for Provider Machine as well Provider Cluster types). Here is the general flow:

  1. CAPI will look for the set of FailureDomains in the ProviderCluster.Spec
  2. The field is copied to the Cluster.Status.FailureDomains
  3. During KCP or MD scale up events, a FD will be choosen from this set and it's value placed in Machine.Spec.FailureDomain. Currently, CAPI tries to equally balance Machines across all FDs.
  4. It is expected that providers will use this chosen FD in the Machine.Spec in deciding where to place the provider specific machine. In the case of metal3, we want CAPM3 to associate the Metal3Machine with the corresponding BMH in the desired FD.

BMH Selection using Labels.

  1. The operator labels the BMH resource based on the physical location of the host. For example, the following standard label could be used on the BMH: infrastructure.cluster.x-k8s.io/failure-domain=<my-fd-1>
  2. Today, CAPM3 chooseHost() func associates a Metal3Machine with a specific BMH based on labels specified in Metal3Machine.Spec.HostSelector.MatchLabels. We can expand this capability.
  3. The HostSelector field is used to narrow down the set of available BMHs that meet the selection criteria. When FDs are being utilized, we can simply insert the above label into the HostSelector.MatchLabels.

Anything else you would like to add:

Related issues: https://github.com/kubernetes-sigs/cluster-api/issues/5666 https://github.com/kubernetes-sigs/cluster-api/issues/5667

/kind feature

Arvinderpal commented 2 years ago

/assign

Arvinderpal commented 2 years ago

@fmuyassarov @kashifest @furkatgofurov7 @maelk Appreciate your thoughts on this. I would be happy to put together a PR for this.

furkatgofurov7 commented 2 years ago

@Arvinderpal hi! Thanks for taking this up here and sorry for the late reply. The addition looks interesting and by going through the linked issues and related PRs, some works on improving the FD support in CAPI seem to be ongoing. Also, just wondering, how is the situation with other providers (BM concerned), do they already support this feature?

MaxRink commented 2 years ago

@furkatgofurov7 CAPV does, by https://doc.crds.dev/github.com/kubernetes-sigs/cluster-api-provider-vsphere/infrastructure.cluster.x-k8s.io/VSphereDeploymentZone/v1beta1@v1.0.2 and https://doc.crds.dev/github.com/kubernetes-sigs/cluster-api-provider-vsphere/infrastructure.cluster.x-k8s.io/VSphereFailureDomain/v1beta1@v1.0.2

the Provider for MAAS also supports them, at least in spec https://doc.crds.dev/github.com/spectrocloud/cluster-api-provider-maas

Rozzii commented 2 years ago

/triage accepted

Arvinderpal commented 2 years ago

Sorry about the delay. Support for FDs with control plane (KCP) nodes is supported within capi. I believe all providers also follow that approach. For worker nodes, there is still some discussion to be had with the broader capi community. There is some initial discussion in https://github.com/kubernetes-sigs/cluster-api/issues/5666 and my PR linked within it.

I think we can start with CP nodes. Any thoughts on the approach I outlined above in the issue description?

furkatgofurov7 commented 2 years ago

@furkatgofurov7 CAPV does, by https://doc.crds.dev/github.com/kubernetes-sigs/cluster-api-provider-vsphere/infrastructure.cluster.x-k8s.io/VSphereDeploymentZone/v1beta1@v1.0.2 and https://doc.crds.dev/github.com/kubernetes-sigs/cluster-api-provider-vsphere/infrastructure.cluster.x-k8s.io/VSphereFailureDomain/v1beta1@v1.0.2

the Provider for MAAS also supports them, at least in spec https://doc.crds.dev/github.com/spectrocloud/cluster-api-provider-maas

@MaxRink thanks.

Sorry about the delay. Support for FDs with control plane (KCP) nodes is supported within capi. I believe all providers also follow that approach. For worker nodes, there is still some discussion to be had with the broader capi community. There is some initial discussion in kubernetes-sigs/cluster-api#5666 and my PR linked within it.

Got it, thanks for the info and I went through them some time ago.

I think we can start with CP nodes. Any thoughts on the approach I outlined above in the issue description?

Agree, but I would suggest opening a proposal for community review and discussing the details of the implementation there as we usually do for these kinds of new features.

Arvinderpal commented 2 years ago

Thanks @furkatgofurov7 I'll put a proposal together and share it.

Arvinderpal commented 2 years ago

Here is the PR with the proposal: https://github.com/metal3-io/metal3-docs/pull/249 @furkatgofurov7 @MaxRink @Rozzii PTAL

I will bring it up during our next metal3 office hours as well. Thank you

Arvinderpal commented 2 years ago

@furkatgofurov7 @MaxRink @Rozzii PTAL at the proposal: https://github.com/metal3-io/metal3-docs/pull/249 Would appreciate your feedback. Thanks

metal3-io-bot commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

furkatgofurov7 commented 2 years ago

/remove-lifecycle stale

metal3-io-bot commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

furkatgofurov7 commented 2 years ago

/remove-lifecycle stale

@Arvinderpal Hi, the proposal has been merged for this feature some time ago, thanks for working on it! Are there plans to implement it in CAPM3 soon?

sf1tzp commented 1 year ago

Hey @furkatgofurov7 @Arvinderpal, I'd like to give this one a shot if I may. I have a draft PR created at the moment, but still need to familiarize myself with the testing & polishing requirements of this repo. Hope I get time this week to make some more progress on it.

furkatgofurov7 commented 1 year ago

@f1tzpatrick hi, sure go ahead!

/unassign @Arvinderpal /assign @f1tzpatrick

metal3-io-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sf1tzp commented 1 year ago

Hey, sorry for the delay on this one. It's still on my todo list! It's been busy for me lately, but I hope to get this tested sometime soon.

I'll keep you posted! 😃

furkatgofurov7 commented 1 year ago

/remove-lifecycle stale

furkatgofurov7 commented 1 year ago

/lifecycle active

metal3-io-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Rozzii commented 1 year ago

hi @f1tzpatrick is this topic still on your TODO ? /remove-lifecycle stale

sf1tzp commented 1 year ago

Hey @Rozzii, it is but I'm sorry it keeps getting pushed to the back burner. I made some progress in #793 but could use a hand testing it. The metal3-dev-env is still new to me and I haven't had enough time to really sit down and go through the process.

metal3-io-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Rozzii commented 1 year ago

/remove-lifecycle stale

metal3-io-bot commented 9 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Rozzii commented 9 months ago

/remove-lifecycle stale /lifecycle frozen I will move this to frozen, this seems to be a legit feature but it keeps coming in&out of stale.

sf1tzp commented 8 months ago

@Rozzii thanks, and sorry for the inconvenience. If I get another chance to return to this in 2024 I'll let you know