Review and validate failure domain support for placement groups, local and wavelength availability zones

randomvariable commented 4 years ago

/kind feature

Describe the solution you'd like [A clear and concise description of what you want to happen.]

CAPA's support for failure domains only takes into consideration Availability Zones.

Inherent in the CAPI model, failure domains are a sort of controller provided property, and don't provide much flexibility for users to define their own.

In AWS, as of August 2020, failure domains include the following dimensions:

Regions
Availability Zones
Placement Groups
- Cluster: Machine colocality for HPC
- Partition: Anti-affinity across logical partitions (i.e. racks inside an AZ)
- Spread: Strict placement of small groups of instances across distinct hardware
Local Zones: An AZ within a particular city to for metro-network access
Wavelength Zones: Like local zones, but tied to a particular cellular network carrier.

The current implementation of failure domains only takes AZs within a region into account.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api-provider-aws version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/milestone next /priority important-longterm

k8s-ci-robot commented 4 years ago

@randomvariable: The provided milestone is not valid for this repository. Milestones in this repository: [Next, v0.6.0, v0.6.1, v0.6.x]

Use /milestone clear to clear the milestone.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/1888): >/kind feature > >**Describe the solution you'd like** >[A clear and concise description of what you want to happen.] > >CAPA's support for failure domains only takes into consideration Availability Zones. > >Inherent in the CAPI model, failure domains are a sort of controller provided property, and don't provide much flexibility for users to define their own. > >In AWS, as of August 2020, failure domains include the following dimensions: > >* Regions >* Availability Zones >* Placement Groups > * Cluster: Machine colocality for HPC > * Partition: Anti-affinity across logical partitions (i.e. racks inside an AZ) > * Spread: Strict placement of small groups of instances across distinct hardware >* Local Zones: An AZ within a particular city to for metro-network access >* Wavelength Zones: Like local zones, but tied to a particular cellular network carrier. > >The current implementation of failure domains only takes AZs within a region into account. > >**Anything else you would like to add:** >[Miscellaneous information that will assist in solving the issue.] > > >**Environment:** > >- Cluster-api-provider-aws version: >- Kubernetes version: (use `kubectl version`): >- OS (e.g. from `/etc/os-release`): > >/milestone next >/priority important-longterm Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

randomvariable commented 3 years ago

/lifecycle frozen /milestone v0.7.0

sedefsavas commented 2 years ago

/triage accepted

JoelSpeed commented 2 years ago

Are there any detailed plans formed around this feature request? Just wondering, as this seems to suggest extending the existing failure domain primitives, how one would use placement groups in conjunction with availability zones.

As far as I understand, the two are not mutually exclusive and so I would expect the failure domain to stay as the availability zone, but also have the ability to specify the name of a placement group as well.

IIUC this is similar to the availability set support in CAPZ, where you can specify an availability set as well as an availability zone if you desire to. Notably as well, I believe CAPZ will create an availability set if it doesn't exist but is specified, and will also delete it once it's no longer required.

I have been working on fleshing out a POC for placement groups within the OpenShift AWS MAPI provider, so would be happy to contribute towards adding placement group support to CAPA as well.

sedefsavas commented 2 years ago

@JoelSpeed There is no active work going on for this. Availability zones, placement groups and possibly other points in the issue all may be suitable to group together considering that they are related to instance distribution. I haven't checked what CAPZ is doing yet. It would be great to have a proposal or ADR for this one.

JoelSpeed commented 2 years ago

Just wanted to post a little more colour to the placement group discussion, we've been discussing this quite a bit within OpenShift and in particular talking about how placement groups should be configured.

Originally we had proposed that the configuration would be part of the MachineTemplate and that the group would be created based on the configuration in the template, however, it was identified that if different configurations were present in different templates, the placement group creation could be non-deterministic as whichever Machine is processed first would win the config and the latter Machines might have different template values.

It seems to me like we need a separate resource (a new CRD?) to be create to represent the placement group. Or as an alternative, I'm wondering if this should be considered a part of the AWSCluster? If we define a list of placement groups as part of the AWSCluster and have the cluster controller reconcile these, then they should be available for the Machines to use as soon as the cluster is set up? Do we have any rules/guidelines for what should/shouldn't become part of the AWSCluster?

richardcase commented 2 years ago

/remove-lifecycle frozen

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

richardcase commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 8 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

vincepri commented 7 months ago

/lifecycle frozen

kubernetes-sigs / cluster-api-provider-aws

Review and validate failure domain support for placement groups, local and wavelength availability zones #1888