Closed otterley closed 3 years ago
/assign @Jeffwan
Prior discussion on documenting alternatives for Mixed Instances ASGs that work well with the autoscaler: https://github.com/kubernetes/autoscaler/issues/2786
@Jeffwan This is exactly what we did in our fork (alongside a bunch of other changes to make it work). We'd be glad to have a short sync/knowledge exchange with the EKS team, maybe we wouldn't need to maintain our own fork anymore :).
fyi @jaypipes as we had a short exchange with you on this topic last summer.
@aermakov-zalando @jrake-revelant Yeah, I will reach out to you and let's see if we can have a good solution in upstream CA. We do see some other users has requests.
@aermakov-zalando @jrake-revelant
From you docs, The template node used for simulating the scale-up will use the minimum values for all resources.
Seems you use minimum values as well. This brings up extra nodes which some users may not want. Did you see any extra cost there?
I think most proper way might be a strategy there and user can configure to build a template? any opnions?
@Jeffwan the extra nodes will be very quickly scaled down again, so the cost impact is minimal. Not using a minimal node can be even worse cost-wise because the autoscaler can end up in a loop where it'll continuously keep bringing up nodes that can't fit a pod that's bigger than the minimum node.
@Jeffwan we are still open to having a dedicated session on this topic
Another problem with "overshooting" besides additional cost is instability of workloads. We have a customer that brings up several machine learning type jobs at once. They all get scheduled but since CA heavily overshoots a lot of the spun up nodes are subsequently terminated leaving a lot of jobs unfinished. "Undershooting" would be the more desirable strategy for said customer.
As @aermakov-zalando explains using the minimum to decide whether an ASG is suitable to host a particular Pod is vital in order to not end up in an endless loop. However, for estimating how many nodes we need we could use different values, such as the maximum to do drastic undershooting or - probably better - some sensible average.
For example, if the policy contains [c5.2xlarge, c5.xlarge, c5.large], the estimation whether a Pod can run at all should be on c5.large (2 vCPUs). Let's assume each Pod needs 1 vCPU, hence can run on any of those instance types, and we want a total of 100 replicas. That means we need a total of 100 vCPUs. Using the minimum approach (c5.large, 2 vCPUs) CA would scale the ASG to 50. Since AWS would bring up the bigger instance types as well that would probably be way too many. If we instead based the estimation on c5.xlarge (4 vCPUs) or c5.2xlarge (8 vCPUs) we would bring up 25 or 13 respectively in the first scaling iteration. This might be closer to the final value we want and avoids additional cost and instability.
Another problem with "overshooting" besides additional cost is instability of workloads. We have a customer that brings up several machine learning type jobs at once. They all get scheduled but since CA heavily overshoots a lot of the spun up nodes are subsequently terminated leaving a lot of jobs unfinished. "Undershooting" would be the more desirable strategy for said customer.
As @aermakov-zalando explains using the minimum to decide whether an ASG is suitable to host a particular Pod is vital in order to not end up in an endless loop. However, for estimating how many nodes we need we could use different values, such as the maximum to do drastic undershooting or - probably better - some sensible average.
For example, if the policy contains [c5.2xlarge, c5.xlarge, c5.large], the estimation whether a Pod can run at all should be on c5.large (2 vCPUs). Let's assume each Pod needs 1 vCPU, hence can run on any of those instance types, and we want a total of 100 replicas. That means we need a total of 100 vCPUs. Using the minimum approach (c5.large, 2 vCPUs) CA would scale the ASG to 50. Since AWS would bring up the bigger instance types as well that would probably be way too many. If we instead based the estimation on c5.xlarge (4 vCPUs) or c5.2xlarge (8 vCPUs) we would bring up 25 or 13 respectively in the first scaling iteration. This might be closer to the final value we want and avoids additional cost and instability.
What happens if you have multiple ASGs and use the least-waste expander? For example: c5.large, c5a.large, c5d.large, c5ad.large, c5dn.large, c5n.large, c4.large c5.xlarge, c5a.xlarge, c5d.xlarge, c5ad.xlarge, c5dn.xlarge, c5n.xlarge, c4.xlarge c5.2xlarge, c5a.2xlarge, c5d.2xlarge, c5ad.2xlarge, c5dn.2xlarge, c5n.2xlarge, c4.2xlarge
This way, you're diversifying each ASG to tap into more Spot capacity pools (which is the purpose here, and bonus points if using the capacity-optimized allocation strategy). While I'm not an expert in the least-waste expander code, and I don't have exhaustive experience with testing use-cases, I have seen two customer cases where in this ASG setup, the ASG with the smallest instance types was selected every time a scaling activity could fit on the smallest types, and the ASG did not fail to scale due to capacity issues (which should be scarce if diversified this way and in the scale in your example).
So I think there are two approaches that can be identified here: having multiple instance sizes in a single ASG, and having CA select according to the smallest instance type, and the second approach is this multi-ASG with same sized instance type in each ASG. With the latter, there are a couple of concerns I have when comparing the approaches:
I still think this is a really viable approach, and I've seen AWS users adopt it very successfully.
@ranshn Unfortunately as you've said failover doesn't work that well, especially with the AWS cloud provider. It's still somewhat feasible with a small number of node pools (and even then we've had to fix some things in our fork), but if we can avoid it it's definitely the better choice. Additionally, AWS has way more visibility into Spot availability in general, so they can avoid giving out instances that will be immediately taken away 5 minutes later, which would be the case if we try to do this on our side.
A few thoughts:
@ranshn is absolutely correct in
What happens if you have multiple ASGs and use the least-waste expander? For example: c5.large, c5a.large, c5d.large, c5ad.large, c5dn.large, c5n.large, c4.large c5.xlarge, c5a.xlarge, c5d.xlarge, c5ad.xlarge, c5dn.xlarge, c5n.xlarge, c4.xlarge c5.2xlarge, c5a.2xlarge, c5d.2xlarge, c5ad.2xlarge, c5dn.2xlarge, c5n.2xlarge, c4.2xlarge
It's considered a best practice to separate spot capacity into similarly sized ASGs. This does result in multiple ASGs, which can have scalability limitations. If these limitations are becoming a blocker, I can see combining these, but as many of you have noted, it violates a core CA assumption and results in side effects.
I'll also echo the statement that using the minimum value for each resource dimension is essential to avoid the case of bringing up extra nodes that can't actually fit the pods. Thus, I'm weakly in the camp that this is the right option in all cases.
With regard to the overshooting comment:
Another problem with "overshooting" besides additional cost is instability of workloads. We have a customer that brings up several machine learning type jobs at once. They all get scheduled but since CA heavily overshoots a lot of the spun up nodes are subsequently terminated leaving a lot of jobs unfinished. "Undershooting" would be the more desirable strategy for said customer.
It's possible to label these pods with cluster-autoscaler.kubernetes.io/safe-to-evict=false
, which you should be doing regardless to help protect your expensive-to-evict workloads.
Given this, the only downside I see with the minimum resource approach is the potential for extra cost if a larger node is spun up. This side effect is fundamental to the decision of using MixedInstancePolicies with instance types of different sizes, so I see this as a (necessary) acceptable tradeoff.
@Jeffwan do you have any other concerns?
@ranshn Unfortunately as you've said failover doesn't work that well, especially with the AWS cloud provider. It's still somewhat feasible with a small number of node pools (and even then we've had to fix some things in our fork), but if we can avoid it it's definitely the better choice.
I think this is really workload dependent. waiting for 10 minutes for ASG to fail in fulfilling capacity before moving to the next one might not work in an e-commerce website where being under provisioned could be wasting money. But, I was replying to a user describing provisioning nodes for machine learning training jobs (AFAIU) and I believe that in many cases, these time insensitive, batch/job style workload should be ok with possibly, in rare cases, waiting for this failover.
Additionally, AWS has way more visibility into Spot availability in general, so they can avoid giving out instances that will be immediately taken away 5 minutes later, which would be the case if we try to do this on our side.
I might be misunderstanding this part of your comment. But as it stands, what you described here is not how Spot, ASG, or the capacity-optimized allocation strategy work. An ASG configured with multiple instance types and the capacity-optimized allocation strategy will work to provision instances from the most-available capacity pools, but it can't do anything to avoid provisioning instances that will be taken X minutes later, there's no guaranteed run time for Spot in any way. I'd love to understand your claim about why an ASG with multiple instance sizes is better in that regards than multiple same-sized diversified ASGs.
@ranshn According to multiple AWS support articles, for example this one, capacity-optimized pools will provide instances of instance types with higher availability, and thus the probability of disruption will be lower. It even says that right there: “This works well for workloads such as big data and analytics, image and media rendering, machine learning, and high performance computing that may have a higher cost of interruption. By offering the possibility of fewer interruptions, the capacity-optimized strategy can lower the overall cost of your workload.”
@ranshn According to multiple AWS support articles, for example this one, capacity-optimized pools will provide instances of instance types with higher availability, and thus the probability of disruption will be lower. It even says that right there: “This works well for workloads such as big data and analytics, image and media rendering, machine learning, and high performance computing that may have a higher cost of interruption. By offering the possibility of fewer interruptions, the capacity-optimized strategy can lower the overall cost of your workload.”
This is true, but it's not how you described it: "avoid giving out instances that will be immediately taken away 5 minutes later". All I'm saying is, for the sake of accuracy, there's no guaranteed minimum run time, which could have been understood from your comment. I'm merely addressing the way you phrased the capabilities of this feature. Also, I still don't get why in the context of using capacity-optimized, an ASG with different instance sizes would be better than multiple ASGs that each have diversified same sized instance types
Also, I still don't get why in the context of using capacity-optimized, an ASG with different instance sizes would be better than multiple ASGs that each have diversified same sized instance types
Because if you have multiple spot pools, where some are close to unavailable and others aren't, it's possible that CA will always choose the problematic ones, and will continue getting instances that would be immediately terminated. With a single capacity-optimized pool AWS will presumably provide better instances in this case.
I think this is the first time that decreasing Spot interruptions with capacity-optimized is discussed in this thread, and I think we're still not on the same wavelength. let me take a step back and clarify what I mean. when you say that CA chooses problematic pools, I suspect that you're referring to single instance-type ASGs, which is definitely not what I'm suggesting, and I agree that if we point CA to multiple ASGs that each have a single instance type, we could be at risk of having a bad experience with Spot Instances because we're not leveraging the capacity-optimized allocation strategy.
What I'm actually suggesting is multiple ASGs, each with different instance types (all the same size i.e c5.xlarge, c5a.xlarge, c5d.xlarge etc). then, on each scaling activity, CA chooses one of the ASGs (according to the expander), and within that ASG, the scaling activity would happen from the most-available pool. We've seen customers cut down their interruptions like this while still achieving their desired scale.
So now the comparison that I'm actually making is:
Yes, it will obviously work better than single-type ASGs. However, because of numerous issues with failover, especially in upstream CA and AWS cloud provider, it's still possible that a shortage of instances will leave the pods in the Pending state for a significant amount of time (half an hour, maybe more, depending on CA configuration and the setup). Since we've already seen severe instance shortages across multiple types even with on-demand in the region we're primarily using, I don't think this is that unlikely.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
When encountering an EC2 Auto Scaling Group that contains a Mixed Instances Policy, the AWS controller for Cluster Autoscaler uses the first instance type in the policy for determining the size of the instance that the Auto Scaler will deliver when scaling out. In #2057 it was pointed out that if the estimation is too large, and the actual instance provisioned is smaller in some dimension than the first instance type in the list, that it is possible for the pending pod not to fit on it.
A simple solution to this problem, also noted in #2057, is to walk through the list of instance types and create a set of dimensions (cpu, memory, etc.) that is the minimum of each corresponding set of dimensions of each type.
One potential drawback of this approach is that it can cause CA to overestimate the number of instances to provision. For example, if the policy contains [c5.xlarge, c5.large], and the estimated node capacity is based on c5.large, then the CA could provision two c5.xlarge instances where only one is needed. This has a cost impact to the customer--albeit a temporary one because eventually the extra instance will be terminated for idleness.
We have an existing method to override the estimator by applying tags to the Auto Scaling Group. For example, by applying the tag
k8s.io/cluster-autoscaler/node-template/resources/cpu
ork8s.io/cluster-autoscaler/node-template/resources/memory
, this will make the Auto Scaler use those values instead of relying on auto-discovery. This method does require that the cluster administrator remember to update them if they modify the Mixed Instances Policy to add a smaller instance type.I think it makes sense to revisit the question of whether to change the existing estimation algorithm.
On the one hand, if we leave it alone, then we continue to bear the risk that an instance that is too small to fit the pending pod could be provisioned. If this happens, it can take over 15 minutes (using current defaults) for CA to determine that the pod is still waiting and take some other action. This delay can be unacceptable to many customers.
On the other hand, if we change the estimation algorithm to minimum-size, then a customer runs the risk of over-shooting and provisioning too many instances, which can incur some needless cost.
The third option is to make the behavior selectable by the user, either via a commandline argument or an environment variable. If we choose this, we must agree on a sensible default.
I would argue that for most customers, guaranteeing pod fit and avoiding scheduling delays is more important than potential overspend. But I would like to hear the community's opinion here.