Enhance the AWSMachineTemplate to support multiple instance types, enable mixed instances in a MachineDeployment

cnmcavoy commented 1 year ago

/kind feature

Describe the solution you'd like Enhance the AWSMachineTemplate to support multiple instance types, specifically an ordered of instances and spot market options to use when creating the ec2 instance to back an AWSMachine, in order to handle provisioning failures when the desired instance type is unavailable. The AWSMachine should provision the ec2 instance with the type and spot options of the first item of the list, and only walk the list and use the other configurations if there is insufficient capacity to provision.

User stories:

As a user, I want to be able to create a MachineDeployment which can be satisfied with multiple instance types of the same capacity but different families. For example, I want to create a node group for m6a.32xlarge or m6i.32xlarge instances, and I want to try creating m6a instances if they are available, but if they are unavailable, use m6i's.
Also, as a user I want to be able to create a MachineDeployment which can use spot instances, but fall back to on-demand instances when the spot market is empty. For example, I want to create a node group for t3.large, but specify spot options to get a better price, and then fall back to on-demand t3.large's when the spot market is empty.

Anything else you would like to add: This would enable mixed instance capacities, e.g a m6a.2xlarge and m6a.4xlarge in the same MachineDeployment, however this kind of configuration is unsupported by the cluster autoscaler, so we may need to improve documentation to state that mixed capacities are unsupported in CAPI as well.

Environment:

Cluster-api-provider-aws version: v2.0.2
Kubernetes version: (use kubectl version): v1.24
OS (e.g. from /etc/os-release): Ubuntu 20.04

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

Ankitasw commented 1 year ago

thanks for the issue @cnmcavoy . AFAIK, we already have AWSMachineList which could be used for this purpose. Could you list out what this proposal offers in addition to what we have today, if AWSMachineList doesn't satisfy your requirements?

Ankitasw commented 1 year ago

cc @richardcase @Skarlso @dlipovetsky to get your views on this. Thinking out loud here whether to support such configurations, I think it would complicate things for what we have today for AWSMachines. But please feel free to enlighten me if you all feel otherwise.

cnmcavoy commented 1 year ago

AFAIK, we already have AWSMachineList which could be used for this purpose. Could you list out what this proposal offers in addition to what we have today, if AWSMachineList doesn't satisfy your requirements?

I'd be interested in how I can use that to answer either of these user stories. Are we talking about this resource?

$ kubectl explain awsmachinelist

KIND:     AWSMachineList
VERSION:  infrastructure.cluster.x-k8s.io/v1beta2

DESCRIPTION:
     AWSMachineList is a list of AWSMachine
...

Because the context of the problem being solved in both cases has to do with an autoscaling environment. I'm not sure how an AWSMachineList would work there... Perhaps I need to make this issue's description more clear?

enxebre commented 1 year ago

I'm really interested in support these use cases as well. However, To satisfy those stories:

How would scale down work? It would just pick a machine based on the current core strategies random/newest/oldest without accounting for any of the above, which is less than optimal.

Also to enable both stories, specially the second one, I'd expect the controllers to rely and delegate on the aws native resources that support this e.g ASG https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-mixed-instances-groups-instance-weighting.html / https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-mixed-instances-groups-instance-weighting.html Then, to abstract away that aws native behaviour I question MachineDeployment is the best resource since this would break many of the current and historical assumptions, i.e. Machine homogeneity which has consequences as stated above for things like scaling down, autoscaling, MHC... In my mind this use case probably belongs to a different abstraction e.g AWSMachinePool or something else.

cnmcavoy commented 1 year ago

How would scale down work? It would just pick a machine based on the current core strategies random/newest/oldest without accounting for any of the above, which is less than optimal.

This is a good question. Scale down logic for picking which Machine to terminate is driven by the MachineSet, which CAPA doesn't have any ownership of. MachineSets have a few possible DeletePolicy configurations, Random, Newest, and Oldest, which would not be aware of this underlying change and kill Machines regardless of their order in the list of instanceDetails. I didn't consider this to be an initial blocker because the core problem I was focused on was provisioning failures, but I think this is worth thinking about...

One naive approach would be to take advantage of the existing Machine annotation cluster.x-k8s.io/delete-machine and set it on the Machines created with a second-order instance type. This would be a simple solution we could do entirely on the CAPA side. MachineSets currently use this annotation mostly for the Cluster Autoscaler to mark downscaling nodes. The drawback of the existing annotation is supersedes the delete policy entirely, causing any Machine with that annotation set to be deleted first in preference to whatever the deletion policy is configured for.

A better approach might be to expose an integer DeletionPriority field on the instanceDetails list object, so users could specify the order these machines should delete (and the value would default to the array element index). The AWSMachine controller could annotate the Machine resource with the correct order after the ec2 instance is created. However, this would necessitate changes to CAPI, so that the configured MachineSet DeletionPolicy is respected, and the order is considered a dimension of that policy.

I'm not sure how necessary making this decision is on a first pass, but I am curious what others think.

Also to enable both stories, specially the second one, I'd expect the controllers to rely and delegate on the aws native resources that support this e.g ASG

ASGs have fundamentally different behavior than MachineDeployments. I took the approach of implementing this because bringing MachinePools up to scratch with MachineDeployments is an even larger amount of work. :)

Also, I'm not sure I think an AWSMachineTemplate need to describe only 1 exact type of ec2 machine. They are certainly implemented that way today, but I see that mostly an artifact of how Cluster Autoscaler limits node group definitions.

MachinePools do support mixed instances in CAPA already, but we lack ways to define lifecycles around the machines they create.

enxebre commented 1 year ago

I was focused on was provisioning failures, but I think this is worth thinking about...

If we scope this change to this particular and concise use case i.e. "provisioning failures" I have no objections with the proposed implementation as far as we document explicitly all the limitations e.g. unhandled involuntary interruptions, scale down (fwiw I'd be in favour not perpetuating the delete-machine for new use cases and rather think of a more atomic solution), autoscaling... so the UX impact doesn't come up as unexpected for consumes of machineDeployment resulting on a heterogenous group of Nodes.

For anything more sophisticated like allocation strategies (capacity-optimized, diversified, capacity-optimized-prioritized, lowest-price, price-capacity-optimized...), percentages of On-Demand Instances and Spot Instances... I'd expect us to think and propose a longer term solution where we delegate decisions on the provider native capabilities e.g. ec2 fleet API rather than reimplementing the logic in a kubernetes controller, this might be backed by scalable resource of some sort: an special MachineDeployment, a MachinePool, some kind of karpenter integration or something else...

cnmcavoy commented 1 year ago

If we scope this change to this particular and concise use case i.e. "provisioning failures" I have no objections with the proposed implementation as far as we document explicitly all the limitations e.g. unhandled involuntary interruptions, scale down (fwiw I'd be in favour not perpetuating the delete-machine for new use cases and rather think of a more atomic solution), autoscaling... so the UX impact doesn't come up as unexpected for consumes of machineDeployment resulting on a heterogenous group of Nodes.

Sounds good, and this is already inline with the proposed implementation, it only handles capacity provisioning failures, other errors block progress in reconciliation the same way as they do today. I will take some time and update the attached PR to include this documentation.

For anything more sophisticated like allocation strategies (capacity-optimized, diversified, capacity-optimized-prioritized, lowest-price, price-capacity-optimized...), percentages of On-Demand Instances and Spot Instances... I'd expect us to think and propose a longer term solution where we delegate decisions on the provider native capabilities e.g. ec2 fleet API rather than reimplementing the logic in a kubernetes controller, this might be backed by scalable resource of some sort: an special MachineDeployment, a MachinePool, some kind of karpenter integration or something else...

Separately, I'm also interested in this, but I agree it's out of scope.

dlipovetsky commented 1 year ago

I consider MachineDeployments to be relatively low-level abstractions.

Thinking out loud, if I need to deploy 5 different instance types, I can create a MachineDeployment for each type. If I need to then scale these, as a group, according to some strategy, I can do build a new controller to do that.

If there are reasons that it is impossible (or very impractical) to use multiple MachineDeployments in this way, I would like to understand them. :pray:

I have a similar view on #3203.

cnmcavoy commented 1 year ago

Thinking out loud, if I need to deploy 5 different instance types, I can create a MachineDeployment for each type. If I need to then scale these, as a group, according to some strategy, I can do build a new controller to do that.

We use MachineDeployments in this pattern today, and use the cluster autoscaler as the controller to scale the MachineDeployments. It works, but has serious drawbacks. The Cluster Autoscaler does not get feedback from CAPA that it's scale up can not be met, so this results in the creation of Machines which never advance past "Provisioning". Then we have to wait for the Cluster Autoscaler's max-node-provision time to be exceeded and for it to choose another MachineDeployment to scale. This is a frankly slow process (a pending pod may not be satisfied for an hour in a realistic scenario using the default CA settings), where as it could be nearly-instant if we move the loop into CAPA.

If there are reasons that it is impossible (or very impractical) to use multiple MachineDeployments in this way, I would like to understand them. pray

Check out User Story 2, regarding mixed spot and on-demand instances, how would I solve this with CAPA today with multiple MachineDeployments? I would be interested, I'm not aware of how we can solve this at the present.

If there a larger story around supporting this kind of use-case, can you link the issues for those? I'm not married to this approach, I'm interesting in solving this problem whatever way is the best for CAPA.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Skarlso commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

richardcase commented 4 months ago

/remove-lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes-sigs / cluster-api-provider-aws

Enhance the AWSMachineTemplate to support multiple instance types, enable mixed instances in a MachineDeployment #4118