AWS Cluster Autoscaler: Multiple options for estimating size of Mixed Instance ASGs

otterley commented 4 years ago

When encountering an EC2 Auto Scaling Group that contains a Mixed Instances Policy, the AWS controller for Cluster Autoscaler uses the first instance type in the policy for determining the size of the instance that the Auto Scaler will deliver when scaling out. In #2057 it was pointed out that if the estimation is too large, and the actual instance provisioned is smaller in some dimension than the first instance type in the list, that it is possible for the pending pod not to fit on it.

A simple solution to this problem, also noted in #2057, is to walk through the list of instance types and create a set of dimensions (cpu, memory, etc.) that is the minimum of each corresponding set of dimensions of each type.

One potential drawback of this approach is that it can cause CA to overestimate the number of instances to provision. For example, if the policy contains [c5.xlarge, c5.large], and the estimated node capacity is based on c5.large, then the CA could provision two c5.xlarge instances where only one is needed. This has a cost impact to the customer--albeit a temporary one because eventually the extra instance will be terminated for idleness.

We have an existing method to override the estimator by applying tags to the Auto Scaling Group. For example, by applying the tag k8s.io/cluster-autoscaler/node-template/resources/cpu or k8s.io/cluster-autoscaler/node-template/resources/memory, this will make the Auto Scaler use those values instead of relying on auto-discovery. This method does require that the cluster administrator remember to update them if they modify the Mixed Instances Policy to add a smaller instance type.

I think it makes sense to revisit the question of whether to change the existing estimation algorithm.

On the one hand, if we leave it alone, then we continue to bear the risk that an instance that is too small to fit the pending pod could be provisioned. If this happens, it can take over 15 minutes (using current defaults) for CA to determine that the pod is still waiting and take some other action. This delay can be unacceptable to many customers.

On the other hand, if we change the estimation algorithm to minimum-size, then a customer runs the risk of over-shooting and provisioning too many instances, which can incur some needless cost.

The third option is to make the behavior selectable by the user, either via a commandline argument or an environment variable. If we choose this, we must agree on a sensible default.

I would argue that for most customers, guaranteeing pod fit and avoiding scheduling delays is more important than potential overspend. But I would like to hear the community's opinion here.

Jeffwan commented 4 years ago

/assign @Jeffwan

ari-becker commented 4 years ago

Prior discussion on documenting alternatives for Mixed Instances ASGs that work well with the autoscaler: https://github.com/kubernetes/autoscaler/issues/2786

aermakov-zalando commented 4 years ago

@Jeffwan This is exactly what we did in our fork (alongside a bunch of other changes to make it work). We'd be glad to have a short sync/knowledge exchange with the EKS team, maybe we wouldn't need to maintain our own fork anymore :).

jrake-revelant commented 4 years ago

fyi @jaypipes as we had a short exchange with you on this topic last summer.

Jeffwan commented 4 years ago

@aermakov-zalando @jrake-revelant Yeah, I will reach out to you and let's see if we can have a good solution in upstream CA. We do see some other users has requests.

Jeffwan commented 4 years ago

@aermakov-zalando @jrake-revelant

From you docs, The template node used for simulating the scale-up will use the minimum values for all resources. Seems you use minimum values as well. This brings up extra nodes which some users may not want. Did you see any extra cost there?

I think most proper way might be a strategy there and user can configure to build a template? any opnions?

aermakov-zalando commented 4 years ago

@Jeffwan the extra nodes will be very quickly scaled down again, so the cost impact is minimal. Not using a minimal node can be even worse cost-wise because the autoscaler can end up in a loop where it'll continuously keep bringing up nodes that can't fit a pod that's bigger than the minimum node.

jrake-revelant commented 4 years ago

@Jeffwan we are still open to having a dedicated session on this topic

linki commented 4 years ago

Another problem with "overshooting" besides additional cost is instability of workloads. We have a customer that brings up several machine learning type jobs at once. They all get scheduled but since CA heavily overshoots a lot of the spun up nodes are subsequently terminated leaving a lot of jobs unfinished. "Undershooting" would be the more desirable strategy for said customer.

As @aermakov-zalando explains using the minimum to decide whether an ASG is suitable to host a particular Pod is vital in order to not end up in an endless loop. However, for estimating how many nodes we need we could use different values, such as the maximum to do drastic undershooting or - probably better - some sensible average.

For example, if the policy contains [c5.2xlarge, c5.xlarge, c5.large], the estimation whether a Pod can run at all should be on c5.large (2 vCPUs). Let's assume each Pod needs 1 vCPU, hence can run on any of those instance types, and we want a total of 100 replicas. That means we need a total of 100 vCPUs. Using the minimum approach (c5.large, 2 vCPUs) CA would scale the ASG to 50. Since AWS would bring up the bigger instance types as well that would probably be way too many. If we instead based the estimation on c5.xlarge (4 vCPUs) or c5.2xlarge (8 vCPUs) we would bring up 25 or 13 respectively in the first scaling iteration. This might be closer to the final value we want and avoids additional cost and instability.

ranshn commented 4 years ago

Another problem with "overshooting" besides additional cost is instability of workloads. We have a customer that brings up several machine learning type jobs at once. They all get scheduled but since CA heavily overshoots a lot of the spun up nodes are subsequently terminated leaving a lot of jobs unfinished. "Undershooting" would be the more desirable strategy for said customer.

As @aermakov-zalando explains using the minimum to decide whether an ASG is suitable to host a particular Pod is vital in order to not end up in an endless loop. However, for estimating how many nodes we need we could use different values, such as the maximum to do drastic undershooting or - probably better - some sensible average.

For example, if the policy contains [c5.2xlarge, c5.xlarge, c5.large], the estimation whether a Pod can run at all should be on c5.large (2 vCPUs). Let's assume each Pod needs 1 vCPU, hence can run on any of those instance types, and we want a total of 100 replicas. That means we need a total of 100 vCPUs. Using the minimum approach (c5.large, 2 vCPUs) CA would scale the ASG to 50. Since AWS would bring up the bigger instance types as well that would probably be way too many. If we instead based the estimation on c5.xlarge (4 vCPUs) or c5.2xlarge (8 vCPUs) we would bring up 25 or 13 respectively in the first scaling iteration. This might be closer to the final value we want and avoids additional cost and instability.

What happens if you have multiple ASGs and use the least-waste expander? For example: c5.large, c5a.large, c5d.large, c5ad.large, c5dn.large, c5n.large, c4.large c5.xlarge, c5a.xlarge, c5d.xlarge, c5ad.xlarge, c5dn.xlarge, c5n.xlarge, c4.xlarge c5.2xlarge, c5a.2xlarge, c5d.2xlarge, c5ad.2xlarge, c5dn.2xlarge, c5n.2xlarge, c4.2xlarge

This way, you're diversifying each ASG to tap into more Spot capacity pools (which is the purpose here, and bonus points if using the capacity-optimized allocation strategy). While I'm not an expert in the least-waste expander code, and I don't have exhaustive experience with testing use-cases, I have seen two customer cases where in this ASG setup, the ASG with the smallest instance types was selected every time a scaling activity could fit on the smallest types, and the ASG did not fail to scale due to capacity issues (which should be scarce if diversified this way and in the scale in your example).

So I think there are two approaches that can be identified here: having multiple instance sizes in a single ASG, and having CA select according to the smallest instance type, and the second approach is this multi-ASG with same sized instance type in each ASG. With the latter, there are a couple of concerns I have when comparing the approaches:

More ASGs, which is never great
"Failing over" between node groups (when the node group which was initially selected is unable to scale, mainly due to capacity issues) has notoriously been a bit painful, with bugs that were lingering and a timeout ( max-node-provision-time - I think?) that's hard to configure properly.

I still think this is a really viable approach, and I've seen AWS users adopt it very successfully.

aermakov-zalando commented 4 years ago

@ranshn Unfortunately as you've said failover doesn't work that well, especially with the AWS cloud provider. It's still somewhat feasible with a small number of node pools (and even then we've had to fix some things in our fork), but if we can avoid it it's definitely the better choice. Additionally, AWS has way more visibility into Spot availability in general, so they can avoid giving out instances that will be immediately taken away 5 minutes later, which would be the case if we try to do this on our side.

ellistarn commented 4 years ago

A few thoughts:

@ranshn is absolutely correct in

What happens if you have multiple ASGs and use the least-waste expander? For example: c5.large, c5a.large, c5d.large, c5ad.large, c5dn.large, c5n.large, c4.large c5.xlarge, c5a.xlarge, c5d.xlarge, c5ad.xlarge, c5dn.xlarge, c5n.xlarge, c4.xlarge c5.2xlarge, c5a.2xlarge, c5d.2xlarge, c5ad.2xlarge, c5dn.2xlarge, c5n.2xlarge, c4.2xlarge

It's considered a best practice to separate spot capacity into similarly sized ASGs. This does result in multiple ASGs, which can have scalability limitations. If these limitations are becoming a blocker, I can see combining these, but as many of you have noted, it violates a core CA assumption and results in side effects.

I'll also echo the statement that using the minimum value for each resource dimension is essential to avoid the case of bringing up extra nodes that can't actually fit the pods. Thus, I'm weakly in the camp that this is the right option in all cases.

With regard to the overshooting comment:

Another problem with "overshooting" besides additional cost is instability of workloads. We have a customer that brings up several machine learning type jobs at once. They all get scheduled but since CA heavily overshoots a lot of the spun up nodes are subsequently terminated leaving a lot of jobs unfinished. "Undershooting" would be the more desirable strategy for said customer.

It's possible to label these pods with cluster-autoscaler.kubernetes.io/safe-to-evict=false, which you should be doing regardless to help protect your expensive-to-evict workloads.

Given this, the only downside I see with the minimum resource approach is the potential for extra cost if a larger node is spun up. This side effect is fundamental to the decision of using MixedInstancePolicies with instance types of different sizes, so I see this as a (necessary) acceptable tradeoff.

@Jeffwan do you have any other concerns?

ranshn commented 4 years ago

@ranshn Unfortunately as you've said failover doesn't work that well, especially with the AWS cloud provider. It's still somewhat feasible with a small number of node pools (and even then we've had to fix some things in our fork), but if we can avoid it it's definitely the better choice.

I think this is really workload dependent. waiting for 10 minutes for ASG to fail in fulfilling capacity before moving to the next one might not work in an e-commerce website where being under provisioned could be wasting money. But, I was replying to a user describing provisioning nodes for machine learning training jobs (AFAIU) and I believe that in many cases, these time insensitive, batch/job style workload should be ok with possibly, in rare cases, waiting for this failover.

Additionally, AWS has way more visibility into Spot availability in general, so they can avoid giving out instances that will be immediately taken away 5 minutes later, which would be the case if we try to do this on our side.

I might be misunderstanding this part of your comment. But as it stands, what you described here is not how Spot, ASG, or the capacity-optimized allocation strategy work. An ASG configured with multiple instance types and the capacity-optimized allocation strategy will work to provision instances from the most-available capacity pools, but it can't do anything to avoid provisioning instances that will be taken X minutes later, there's no guaranteed run time for Spot in any way. I'd love to understand your claim about why an ASG with multiple instance sizes is better in that regards than multiple same-sized diversified ASGs.

aermakov-zalando commented 4 years ago

@ranshn According to multiple AWS support articles, for example this one, capacity-optimized pools will provide instances of instance types with higher availability, and thus the probability of disruption will be lower. It even says that right there: “This works well for workloads such as big data and analytics, image and media rendering, machine learning, and high performance computing that may have a higher cost of interruption. By offering the possibility of fewer interruptions, the capacity-optimized strategy can lower the overall cost of your workload.”

ranshn commented 4 years ago

@ranshn According to multiple AWS support articles, for example this one, capacity-optimized pools will provide instances of instance types with higher availability, and thus the probability of disruption will be lower. It even says that right there: “This works well for workloads such as big data and analytics, image and media rendering, machine learning, and high performance computing that may have a higher cost of interruption. By offering the possibility of fewer interruptions, the capacity-optimized strategy can lower the overall cost of your workload.”

This is true, but it's not how you described it: "avoid giving out instances that will be immediately taken away 5 minutes later". All I'm saying is, for the sake of accuracy, there's no guaranteed minimum run time, which could have been understood from your comment. I'm merely addressing the way you phrased the capabilities of this feature. Also, I still don't get why in the context of using capacity-optimized, an ASG with different instance sizes would be better than multiple ASGs that each have diversified same sized instance types

aermakov-zalando commented 4 years ago

Also, I still don't get why in the context of using capacity-optimized, an ASG with different instance sizes would be better than multiple ASGs that each have diversified same sized instance types

Because if you have multiple spot pools, where some are close to unavailable and others aren't, it's possible that CA will always choose the problematic ones, and will continue getting instances that would be immediately terminated. With a single capacity-optimized pool AWS will presumably provide better instances in this case.

ranshn commented 4 years ago

I think this is the first time that decreasing Spot interruptions with capacity-optimized is discussed in this thread, and I think we're still not on the same wavelength. let me take a step back and clarify what I mean. when you say that CA chooses problematic pools, I suspect that you're referring to single instance-type ASGs, which is definitely not what I'm suggesting, and I agree that if we point CA to multiple ASGs that each have a single instance type, we could be at risk of having a bad experience with Spot Instances because we're not leveraging the capacity-optimized allocation strategy.

What I'm actually suggesting is multiple ASGs, each with different instance types (all the same size i.e c5.xlarge, c5a.xlarge, c5d.xlarge etc). then, on each scaling activity, CA chooses one of the ASGs (according to the expander), and within that ASG, the scaling activity would happen from the most-available pool. We've seen customers cut down their interruptions like this while still achieving their desired scale.

So now the comparison that I'm actually making is:

Pointing CA to a single ASG with, for example: m4.2xlarge, m4.4xlarge, m5.2xlarge, m5.4xlarge, r4.2xlarge, r5.2xlarge (this example is from Henning's blog post, but I don't know if it's representative of what you guys are doing typically), and having CA scale according to the minimum hardware dimensions, which is what the fork is doing and what is suggested in this thread. When ASG's desired capacity is increased, it'll choose the instances from the most-available pools in each AZ, which is great for decreasing interruption rates and decreases the chances of instance thrashing.
Pointing CA to multiple, same sized ASGs, for example (I'm working off the same instance types example because I assume that the minimum pod requirements or the initial qualified instance type is m*.2xlarge) m4.2xlarge, m5.2xlarge, m5a.2xlarge, m5d.2xlarge, m5ad.2xlarge, m5n.2xlarge, m5dn.2xlarge m4.4xlarge, m5.4xlarge, m5a.4xlarge, m5d.4xlarge, m5ad.4xlarge, m5n.4xlarge, m5dn.4xlarge Now, except for waiting a few minutes for a failover in case one ASG is unable to fulfill any Spot capacity (which should be a rare occurrence or non-existent for many customers, depending on the scale, region, etc), and having a larger number of ASGs, I'm hard pressed to understand the disadvantage of using this approach, which already works great with current upstream CA on AWS.

aermakov-zalando commented 4 years ago

Yes, it will obviously work better than single-type ASGs. However, because of numerous issues with failover, especially in upstream CA and AWS cloud provider, it's still possible that a shortage of instances will leave the pods in the Pending state for a significant amount of time (half an hour, maybe more, depending on CA configuration and the setup). Since we've already seen severe instance shortages across multiple types even with on-demand in the region we're primarily using, I don't think this is that unlikely.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/3217#issuecomment-761560727): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / autoscaler

AWS Cluster Autoscaler: Multiple options for estimating size of Mixed Instance ASGs #3217