Cluster Autoscaler support for AWS EC2 attribute-based instance selection

youwalther65 commented 1 year ago

Which component are you using?: Cluster Autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: AWS EC2 has a rich set of instance types. AWS atribute-based instance selection described here provides an easy way to specify instance selection for an Auto Scaling Group by providing for example required number of vCPU and memory. Following a Terraform example using this in EKS module self-managed node group:

    smng-mixed = {
      name = "smng-mixed"
…
      use_mixed_instances_policy = true
      mixed_instances_policy = {
        instances_distribution = {
          on_demand_base_capacity                  = 0
          on_demand_percentage_above_base_capacity = 0
          spot_allocation_strategy                 = "price-capacity-optimized"
          # SpotInstancePools option is only available with the lowest-price allocation strategy
          #spot_instance_pools                      = 2
        }

        # does not work with Cluster Auroscaler because he can't build a proper template node :-(
        # this is a list so commas are mandatory
        override = [
          {
            # attribute-based instance selection
            # this is a map so commas are optional
            instance_requirements = {
                vcpu_count = {
                  min = 4
                  max = 4
                },
                memory_mib = {
                  min = 16384
                  max = 16384
                },
                burstable_performance = "excluded",
                excluded_instance_types = ["d*","g*","x*","z*"],
            }
          },

Describe the solution you'd like.: At the moment Cluster Autoscaler is not able to create a node template and raises the following error in leeader logs:

$ k logs -n kube-system cluster-autoscaler-aws-cluster-autoscaler-xxx
…
E0308 15:38:05.516606       1 mixed_nodeinfos_processor.go:151] Unable to build proper template node for smng-mixed-2023022810062797600000002d: ASG "smng-mixed-2023022810062797600000002d" uses the unknown EC2 instance type ""
E0308 15:38:05.516615       1 static_autoscaler.go:290] Failed to get node infos for groups: ASG "smng-mixed-2023022810062797600000002d" uses the unknown EC2 instance type ""

Describe any alternative solutions you've considered.: Develop a way to build a proper template node by either:

either calling the AWS EC2 API GetInstanceTypesFromInstanceRequirements
or by using vCPU, memory information etc. from InstanceRequirements from LaunchTemplate or Auto Scaling group MixedInstancesPolicy object

Additional context.: N/A

bwagner5 commented 1 year ago

This should have been implemented in this PR https://github.com/kubernetes/autoscaler/pull/4588

What version of CAS are you using?

bwagner5 commented 1 year ago

Linking this issue since the feature appears to be working in @bpineau's case based on PR's testing:

https://github.com/kubernetes/autoscaler/pull/5550

youwalther65 commented 1 year ago

@bwagner5 When looking at the Terraform code, the instance requirements are in ASG LT override, not the LT itself. Could this be the reason? This is the easy way to use the EKS modules self-managed node group. If only LT is queried then one have to use either custom LT or AWS provider resorces instead.

bwagner5 commented 1 year ago

It should work for both in the LT and as an LT ASG Override. Are you able to try it with an LT instead of an LT override though just to see?

youwalther65 commented 1 year ago

@bwagner5 I saw that your PR was merged into CAS 1.25 and most probably not backported to older versions of CAS, can you please confirm?! I just checked and it seems I still had v1.24.0 image of CAS (just recently switched to EKS 1.25 after it's release). Now I switched to CAS image v1.25.0 and the error is no longer visible.

Here the data:

latest Helm chart

$ helm list -n kube-system | grep cluster-autoscaler
cluster-autoscaler              kube-system     4               2023-03-10 07:26:44.405431457 +0000 UTC deployed        cluster-autoscaler-9.26.0                      1.24.0

Image:

$ k get deploy -n kube-system cluster-autoscaler-aws-cluster-autoscaler -o yaml |  yq e '.spec.template.spec.containers[0].image'
registry.k8s.io/autoscaling/cluster-autoscaler:v1.25.0

EKS version:

$ k version --short
...yaml
Server Version: v1.25.6-eks-48e63af

ASG info:

$ aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names smng-mixed-2023022810062797600000002d
{
    "AutoScalingGroups": [
        {
            "AutoScalingGroupName": "smng-mixed-2023022810062797600000002d",
            "AutoScalingGroupARN": "arn:aws:autoscaling:eu-west-1:<redacted>:autoScalingGroup:<redacted>:autoScalingGroupName/smng-mixed-2023022810062797600000002d",
            "MixedInstancesPolicy": {
                "LaunchTemplate": {
                    "LaunchTemplateSpecification": {
                        "LaunchTemplateId": "lt-090929890da8f991b",
                        "LaunchTemplateName": "smng-mixed-20230228100627260600000022",
                        "Version": "1"
                    },
                    "Overrides": [
                        {
                            "InstanceRequirements": {
                                "VCpuCount": {
                                    "Min": 4,
                                    "Max": 4
                                },
                                "MemoryMiB": {
                                    "Min": 16384,
                                    "Max": 16384
                                },
                                "ExcludedInstanceTypes": [
                                    "g*",
                                    "d*",
                                    "z*",
                                    "x*"
                                ],
                                "BurstablePerformance": "excluded"
                            }
                        }
                    ]
                },
...

bwagner5 commented 1 year ago

Yes, that is correct.

spr-mweber3 commented 1 year ago

Unfortunately, running into the same issue. Even with the latest chart version and the latest 1.26.1 release of the autoscaler. I did upgrade from 1.24.0 to see if the problem is gone now. But unfortunately that doesn't seem to be the case.

In my case I'm aswell using attribute based selection of EC2 instance types.

static_autoscaler.go:290] Failed to get node infos for groups: ASG "eks1-euc1-stg-etc" uses the unknown EC2 instance type "" mixed_nodeinfos_processor.go:151] Unable to build proper template node for ...

This issue only occurs if the ASG is scaled to 0 when the autoscaler is starting up. As soon as I scale up to 1 and restart the autoscaler, it will work.

Someone else also raised a question here: https://devops.stackexchange.com/questions/16833/cluster-autoscaler-crash-unable-to-build-proper-template-node

youwalther65 commented 1 year ago

@spr-mweber3 It worked for me even on 1.25.0. I used a self-managed node group with taints and tolerations and added those as ASG tags as required for scale-from-0. For managed node groups CAS just need the eks:DescribeNodegroup IAM permission to recognize labels and taints.

CAS leader log excerpt
...
I0315 12:57:46.750288       1 expiration_cache.go:103] Entry smng-mixed-2023022810062797600000002d: {name:smng-mixed-2023022810062797600000002d instanceType:m4.xlarge} has expired
...
I0315 13:04:11.967399       1 scale_up.go:477] Best option to resize: smng-mixed-2023022810062797600000002d
I0315 13:04:11.967414       1 scale_up.go:481] Estimated 1 nodes needed in smng-mixed-2023022810062797600000002d
I0315 13:04:11.967440       1 scale_up.go:601] Final scale-up plan: [{smng-mixed-2023022810062797600000002d 0->1 (max: 3)}]
I0315 13:04:11.967461       1 scale_up.go:700] Scale-up: setting group smng-mixed-2023022810062797600000002d size to 1
I0315 13:04:11.967485       1 auto_scaling_groups.go:248] Setting asg smng-mixed-2023022810062797600000002d size to 1
I0315 13:04:11.967780       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"12800f36-344b-4f5e-8e32-1f39380d60db", APIVersion:"v1", ResourceVersion:"21128420", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group smng-mixed-2023022810062797600000002d size to 1 instead of 0 (max: 3)
I0315 13:04:12.118636       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"12800f36-344b-4f5e-8e32-1f39380d60db", APIVersion:"v1", ResourceVersion:"21128420", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group smng-mixed-2023022810062797600000002d size set to 1 instead of 0 (max: 3)
I0315 13:04:12.125654       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"inflate-multi-az-system-comp-6b67d44f55-86jnd", UID:"344075cb-9762-4594-9f73-e9a3171b53f7", APIVersion:"v1", ResourceVersion:"21128440", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{smng-mixed-2023022810062797600000002d 0->1 (max: 3)}]
I0315 13:04:12.133217       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"inflate-multi-az-system-comp-6b67d44f55-gcxgr", UID:"64c3a596-3b19-45e1-91c2-26a049d64473", APIVersion:"v1", ResourceVersion:"21128436", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{smng-mixed-2023022810062797600000002d 0->1 (max: 3)}]

SMNG uses the the Terraform code I showed in initial comment.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Joldnine commented 9 months ago

Have the same issue. May I know any update for the thread? Is it resolved in the later releases..

Shubham82 commented 6 months ago

/remove-lifecycle rotten

Shubham82 commented 6 months ago

/area provider/aws

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/autoscaler/issues/5580#issuecomment-2154647342): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / autoscaler

Cluster Autoscaler support for AWS EC2 attribute-based instance selection #5580