kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
534 stars 174 forks source link

Provisioner falls back to lower weight when near limit #734

Closed FernandoMiguel closed 2 weeks ago

FernandoMiguel commented 1 year ago

Version

Karpenter Version: v0.20.0 Kubernetes Version: v1.24.0 Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.7-eks-fb459a0", GitCommit:"c240013134c03a740781ffa1436ba2688b50b494", GitTreeState:"clean", BuildDate:"2022-10-24T20:36:26Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}

Expected Behavior

I have two providers , one with a smaller number of instances, and one much bigger the 1st one has a weight of 50

i've just spun up 20 pods/nodes and noticed it was using both providers and picking instances that could fit in the 1st one. trying to understand why that happened.

should the weight be bigger?

Actual Behavior

N/A

Steps to Reproduce the Problem

N/A

Resource Specs and Logs

sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.453Z   DEBUG   controller.provisioner  96 out of 599 instance types were excluded because they would breach provisioner limits {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.466Z   DEBUG   controller.provisioner  101 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.480Z   DEBUG   controller.provisioner  164 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.488Z   DEBUG   controller.provisioner  220 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.495Z   DEBUG   controller.provisioner  287 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.500Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.503Z   DEBUG   controller.provisioner  43 out of 599 instance types were excluded because they would breach provisioner limits {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.513Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.516Z   DEBUG   controller.provisioner  96 out of 599 instance types were excluded because they would breach provisioner limits {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.544Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.547Z   DEBUG   controller.provisioner  101 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.568Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.571Z   DEBUG   controller.provisioner  164 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.591Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.600Z   DEBUG   controller.provisioner  215 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.612Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.615Z   DEBUG   controller.provisioner  287 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.625Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.629Z   DEBUG   controller.provisioner  432 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.644Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.646Z   DEBUG   controller.provisioner  relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app":"inflate"}}}   {"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-zhg2b"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.647Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.649Z   DEBUG   controller.provisioner  relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app":"inflate"}}}   {"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-lxct6"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.650Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.651Z   DEBUG   controller.provisioner  relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app":"inflate"}}}   {"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-gz7zp"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.652Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.654Z   DEBUG   controller.provisioner  relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app":"inflate"}}}   {"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-qhq7w"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.655Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.656Z   DEBUG   controller.provisioner  relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app":"inflate"}}}   {"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-nfn6p"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.657Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.659Z   DEBUG   controller.provisioner  relaxing soft constraints for pod since it previously failed to schedule, removing: spec.topologySpreadConstraints = {"maxSkew":1,"topologyKey":"topology.kubernetes.io/zone","whenUnsatisfiable":"ScheduleAnyway","labelSelector":{"matchLabels":{"app":"inflate"}}}   {"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-nc6rn"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.660Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.662Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.664Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.666Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.669Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.673Z   DEBUG   controller.provisioner  506 out of 599 instance types were excluded because they would breach provisioner limits    {"commit": "f60dacd"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z   ERROR   controller.provisioner  Could not schedule pod, incompatible with provisioner "fernando-in-bull-9e58ab3c10185819", no instance type satisfied resources {"cpu":"3","pods":"1"} and requirements karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.k8s.aws/instance-generation In [6 7], karpenter.k8s.aws/instance-category NotIn [a t], karpenter.k8s.aws/instance-memory Exists <130000, karpenter.k8s.aws/instance-cpu Exists <17, kubernetes.io/arch In [amd64 arm64], kubernetes.io/os In [linux], karpenter.k8s.aws/instance-family NotIn [z1d], karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/provisioner-name In [fernando-in-bull-9e58ab3c10185819], topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c us-east-1d us-east-1f], karpenter.k8s.aws/instance-size NotIn [metal]; all available instance types exceed provisioner limits   {"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-zhg2b"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z   ERROR   controller.provisioner  Could not schedule pod, incompatible with provisioner "fernando-in-bull-9e58ab3c10185819", no instance type satisfied resources {"cpu":"3","pods":"1"} and requirements karpenter.k8s.aws/instance-family NotIn [z1d], karpenter.k8s.aws/instance-memory Exists <130000, karpenter.k8s.aws/instance-category NotIn [a t], karpenter.k8s.aws/instance-generation In [6 7], karpenter.sh/provisioner-name In [fernando-in-bull-9e58ab3c10185819], topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c us-east-1d us-east-1f], karpenter.k8s.aws/instance-size NotIn [metal], karpenter.k8s.aws/instance-hypervisor In [nitro], kubernetes.io/os In [linux], karpenter.sh/capacity-type In [on-demand spot], kubernetes.io/arch In [amd64 arm64], karpenter.k8s.aws/instance-cpu Exists <17; all available instance types exceed provisioner limits   {"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-lxct6"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z   ERROR   controller.provisioner  Could not schedule pod, incompatible with provisioner "fernando-in-bull-9e58ab3c10185819", no instance type satisfied resources {"cpu":"3","pods":"1"} and requirements topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c us-east-1d us-east-1f], karpenter.k8s.aws/instance-category NotIn [a t], karpenter.k8s.aws/instance-size NotIn [metal], karpenter.k8s.aws/instance-generation In [6 7], karpenter.k8s.aws/instance-cpu Exists <17, karpenter.k8s.aws/instance-family NotIn [z1d], karpenter.k8s.aws/instance-memory Exists <130000, karpenter.sh/capacity-type In [on-demand spot], karpenter.k8s.aws/instance-hypervisor In [nitro], kubernetes.io/os In [linux], kubernetes.io/arch In [amd64 arm64], karpenter.sh/provisioner-name In [fernando-in-bull-9e58ab3c10185819]; all available instance types exceed provisioner limits   {"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-gz7zp"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z   ERROR   controller.provisioner  Could not schedule pod, incompatible with provisioner "fernando-in-bull-9e58ab3c10185819", no instance type satisfied resources {"cpu":"3","pods":"1"} and requirements kubernetes.io/os In [linux], karpenter.k8s.aws/instance-cpu Exists <17, karpenter.k8s.aws/instance-memory Exists <130000, topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c us-east-1d us-east-1f], karpenter.k8s.aws/instance-generation In [6 7], karpenter.k8s.aws/instance-family NotIn [z1d], kubernetes.io/arch In [amd64 arm64], karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/provisioner-name In [fernando-in-bull-9e58ab3c10185819], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.k8s.aws/instance-category NotIn [a t], karpenter.k8s.aws/instance-size NotIn [metal]; all available instance types exceed provisioner limits   {"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-qhq7w"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z   ERROR   controller.provisioner  Could not schedule pod, incompatible with provisioner "fernando-in-bull-9e58ab3c10185819", no instance type satisfied resources {"cpu":"3","pods":"1"} and requirements kubernetes.io/os In [linux], karpenter.k8s.aws/instance-family NotIn [z1d], karpenter.k8s.aws/instance-size NotIn [metal], karpenter.k8s.aws/instance-cpu Exists <17, kubernetes.io/arch In [amd64 arm64], karpenter.sh/capacity-type In [on-demand spot], karpenter.k8s.aws/instance-generation In [6 7], karpenter.k8s.aws/instance-category NotIn [a t], karpenter.sh/provisioner-name In [fernando-in-bull-9e58ab3c10185819], karpenter.k8s.aws/instance-memory Exists <130000, topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c us-east-1d us-east-1f], karpenter.k8s.aws/instance-hypervisor In [nitro]; all available instance types exceed provisioner limits   {"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-nfn6p"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z   ERROR   controller.provisioner  Could not schedule pod, incompatible with provisioner "fernando-in-bull-9e58ab3c10185819", no instance type satisfied resources {"cpu":"3","pods":"1"} and requirements karpenter.k8s.aws/instance-memory Exists <130000, kubernetes.io/arch In [amd64 arm64], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.sh/provisioner-name In [fernando-in-bull-9e58ab3c10185819], karpenter.k8s.aws/instance-category NotIn [a t], topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c us-east-1d us-east-1f], karpenter.k8s.aws/instance-size NotIn [metal], karpenter.k8s.aws/instance-generation In [6 7], karpenter.sh/capacity-type In [on-demand spot], karpenter.k8s.aws/instance-cpu Exists <17, karpenter.k8s.aws/instance-family NotIn [z1d], kubernetes.io/os In [linux]; all available instance types exceed provisioner limits   {"commit": "f60dacd", "pod": "pause/inflate-6886cd9c5f-nc6rn"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z   INFO    controller.provisioner  found provisionable pod(s)  {"commit": "f60dacd", "pods": 18}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.675Z   INFO    controller.provisioner  computed new node(s) to fit pod(s)  {"commit": "f60dacd", "nodes": 12, "pods": 12}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.676Z   INFO    controller.provisioner  launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6in.xlarge, c6gn.xlarge, m6g.xlarge, m6i.xlarge, c6a.xlarge and 54 other(s)    {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.678Z   INFO    controller.provisioner  launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types m6g.xlarge, m6i.xlarge, m6id.xlarge, c6gd.xlarge, r6gd.xlarge and 41 other(s)   {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.679Z   INFO    controller.provisioner  launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6g.xlarge, m6i.xlarge, m6g.xlarge, m6id.xlarge, c7g.xlarge and 41 other(s) {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.680Z   INFO    controller.provisioner  launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6g.xlarge, m6i.xlarge, m6g.xlarge, m6id.xlarge, c7g.xlarge and 41 other(s) {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.682Z   INFO    controller.provisioner  launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6in.xlarge, m6g.xlarge, m6i.xlarge, m6in.xlarge, m6idn.xlarge and 54 other(s)  {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.687Z   INFO    controller.provisioner  launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6in.xlarge, m5.xlarge, m5n.xlarge, c6gn.xlarge, m6g.xlarge and 132 other(s)    {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.690Z   INFO    controller.provisioner  launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types m5n.xlarge, m5.xlarge, m6g.xlarge, m6i.xlarge, r5.xlarge and 119 other(s)   {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.694Z   INFO    controller.provisioner  launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6in.xlarge, c6gd.xlarge, c5a.xlarge, m5n.xlarge, c6gn.xlarge and 113 other(s)  {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.698Z   INFO    controller.provisioner  launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6in.xlarge, c6gd.xlarge, c5a.xlarge, m5n.xlarge, c6gn.xlarge and 113 other(s)  {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.702Z   INFO    controller.provisioner  launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6in.xlarge, m5.xlarge, m5n.xlarge, c6gn.xlarge, m6g.xlarge and 132 other(s)    {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.705Z   INFO    controller.provisioner  launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types c6g.xlarge, m5n.xlarge, m5.xlarge, m6i.xlarge, m6g.xlarge and 116 other(s)  {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.706Z   INFO    controller.provisioner  launching node with 1 pods requesting {"cpu":"3125m","pods":"4"} from types m5n.xlarge, m6g.xlarge, m5.xlarge, m6i.xlarge, r5.xlarge and 41 other(s)    {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.949Z   DEBUG   controller.provisioner.cloudprovider    discovered new ami  {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819", "ami": "ami-0607aea1f8780fc6c", "query": "/aws/service/bottlerocket/aws-k8s-1.24/arm64/latest/image_id"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:40.989Z   DEBUG   controller.provisioner.cloudprovider    discovered launch template  {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819", "launch-template-name": "Karpenter-fernando-in-bull-11069147462930180006"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:41.139Z   DEBUG   controller.provisioner.cloudprovider    created launch template {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819", "launch-template-name": "Karpenter-fernando-in-bull-17861520481371219300", "launch-template-id": "lt-08017f24bc95d402a"}
sre-karpenter-system/karpenter-6b7c8c86d6-rz5ch[controller]: 2022-12-23T14:29:41.297Z   DEBUG   controller.provisioner.cloudprovider    created launch template {"commit": "f60dacd", "provisioner": "fernando-in-bull-9e58ab3c10185819-bigger-hw-pool", "launch-template-name": "Karpenter-fernando-in-bull-3445210251536975458", "launch-template-id": "lt-0d54ee3ae6743def9"}

p1

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  creationTimestamp: "2022-12-23T11:58:09Z"
  generation: 1
  name: fernando-in-bull-9e58ab3c10185819
  resourceVersion: "44773"
  uid: 8caa2779-65ca-405d-b2a2-005c9a9eaab2
spec:
  consolidation:
    enabled: true
  limits:
    resources:
      cpu: "100"
  providerRef:
    name: fernando-in-bull-9e58ab3c10185819
  requirements:
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-east-1a
    - us-east-1b
    - us-east-1c
    - us-east-1d
    - us-east-1f
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
    - on-demand
  - key: karpenter.k8s.aws/instance-category
    operator: NotIn
    values:
    - a
    - t
  - key: karpenter.k8s.aws/instance-family
    operator: NotIn
    values:
    - z1d
  - key: karpenter.k8s.aws/instance-size
    operator: NotIn
    values:
    - metal
  - key: karpenter.k8s.aws/instance-hypervisor
    operator: In
    values:
    - nitro
  - key: karpenter.k8s.aws/instance-generation
    operator: In
    values:
    - "6"
    - "7"
  - key: karpenter.k8s.aws/instance-cpu
    operator: Lt
    values:
    - "17"
  - key: karpenter.k8s.aws/instance-memory
    operator: Lt
    values:
    - "130000"
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
    - arm64
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  ttlSecondsUntilExpired: 2592000
  weight: 50
status:
  resources:
    attachable-volumes-aws-ebs: "507"
    cpu: "58"
    ephemeral-storage: 1341467764Ki
    memory: 192902248Ki
    pods: "1089"
    vpc.amazonaws.com/pod-eni: "99"

p2

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  creationTimestamp: "2022-12-23T11:58:09Z"
  generation: 1
  name: fernando-in-bull-9e58ab3c10185819-bigger-hw-pool
  resourceVersion: "44888"
  uid: 7c8d8ddf-9c0c-42db-81bc-a5e13cb5028a
spec:
  consolidation:
    enabled: true
  limits:
    resources:
      cpu: "100"
  providerRef:
    name: fernando-in-bull-9e58ab3c10185819-bigger-hw-pool
  requirements:
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-east-1a
    - us-east-1b
    - us-east-1c
    - us-east-1d
    - us-east-1f
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
    - on-demand
  - key: karpenter.k8s.aws/instance-category
    operator: NotIn
    values:
    - a
    - t
  - key: karpenter.k8s.aws/instance-family
    operator: NotIn
    values:
    - z1d
  - key: karpenter.k8s.aws/instance-size
    operator: NotIn
    values:
    - metal
  - key: karpenter.k8s.aws/instance-hypervisor
    operator: In
    values:
    - nitro
  - key: karpenter.k8s.aws/instance-generation
    operator: NotIn
    values:
    - "1"
    - "2"
  - key: karpenter.k8s.aws/instance-cpu
    operator: Lt
    values:
    - "17"
  - key: karpenter.k8s.aws/instance-memory
    operator: Lt
    values:
    - "130000"
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
    - arm64
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  ttlSecondsUntilExpired: 2592000
status:
  resources:
    attachable-volumes-aws-ebs: "312"
    cpu: "32"
    ephemeral-storage: 825518624Ki
    memory: 63705340Ki
    pods: "620"
    vpc.amazonaws.com/pod-eni: "90"

pod

kind: Deployment
apiVersion: apps/v1
metadata:
  name: inflate
  namespace: pause
  uid: c713003b-cb1c-48de-a495-c8b8e955321f
  resourceVersion: "45492"
  generation: 2
  creationTimestamp: "2022-12-23T14:27:17Z"
  labels:
    app: inflate
  annotations:
    deployment.kubernetes.io/revision: "1"
spec:
  replicas: 20
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: inflate
    spec:
      containers:
      - name: inflate
        image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
        resources:
          requests:
            cpu: "3"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 0
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      securityContext: {}
      schedulerName: default-scheduler
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: inflate
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: inflate
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600
status:
  observedGeneration: 2
  replicas: 20
  updatedReplicas: 20
  readyReplicas: 20
  availableReplicas: 20
  conditions:
  - type: Progressing
    status: "True"
    lastUpdateTime: "2022-12-23T14:28:22Z"
    lastTransitionTime: "2022-12-23T14:27:17Z"
    reason: NewReplicaSetAvailable
    message: ReplicaSet "inflate-6886cd9c5f" has successfully progressed.
  - type: Available
    status: "True"
    lastUpdateTime: "2022-12-23T14:31:04Z"
    lastTransitionTime: "2022-12-23T14:31:04Z"
    reason: MinimumReplicasAvailable
    message: Deployment has minimum availability.

image

Community Note

jonathan-innis commented 1 year ago

Applying this on my own cluster, it looks like the higher-weighted provisioner is too restrictive on instance types. If I add the following requirement to the first provisioner, I'm not even able to provision using the first provisioner

  requirements:
  - key: topology.kubernetes.io/zone
    operator: NotIn
    values:
    - us-east-1a
    - us-east-1d

implying that the instance types that are available using the higher-weighted provisioner are only in these AZs. What's happening with two provisioners is that topology spread is causing the scheduler to add a requirement within the provisioner's topology domains causing the launched node to be in a specific AZ (like us-east-1b) since the first provisioner has no instance types that satisfy this requirement, it has to move to the lower-weighted provisioner.

I would suggest working to open up the higher-weighted provisioner a bit if you are trying to avoid having it fallback to the second provisioner too often.

jonathan-innis commented 1 year ago

From the perspective of the provisioner weighting mechanism, this is performing how I would expect. It's choosing the highest weighted provisioner in zones where it can and not in zones where it can't.

FernandoMiguel commented 1 year ago

the 1st one has still a very large number of instance types. and all the instances spun up from the 2nd could fit in the 1st. sure topology is at play here, but both continue to spin up nodes on all 5 AZs.

and the pod has ScheduleAnyway

      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway

so I remain unsure what difference did it make picking up the 2nd provisioner, @jonathan-innis

jonathan-innis commented 1 year ago

and the pod has ScheduleAnyway

We still try to satisfy the topology constraint across all provisioners before we relax the constraint, even if it's a preferred constraint; so, the fact that it's a ScheduleAnyway constraint doesn't make a difference in terms of falling back between provisioners.

the 1st one has still a very large number of instance types, and all the instances spun up from the 2nd could fit in the 1st.

That's a good point. It looks like I may be working under constrained capacity pools right now. Will take another look since you're right, both provisioners launch with the same instance type in the same AZ in your example.

jonathan-innis commented 1 year ago

My other guess is that we are hitting provisioner limits and falling back to the second provisioner. During scheduling, we assume worst-case among the proposed instance types when subtracting from the limits so that we don't overshoot the limits after launch.

If you bump up the cpu limit or remove it entirely do you see the same fallback behavior?

FernandoMiguel commented 1 year ago

i'll re-try with 10k CPUs... but none of the provisioners ever hit 100 CPUs, according to the info in each of them. even the total of both is less than 100

jonathan-innis commented 1 year ago

During scheduling, we assume worst-case among the proposed instance types when subtracting from the limits so that we don't overshoot the limits after launch.

but none of the provisioners ever hit 100 CPUs

This means that even if we don't hit the limits after launch, we may still consider a provisioner to have hit its limits prior to launch since we take the largest instance possible to be the representative until after we actually know what we launched.

FernandoMiguel commented 1 year ago

interesting to learn

jonathan-innis commented 1 year ago

@FernandoMiguel Did removing the provisioner limits fix the issue?

FernandoMiguel commented 1 year ago

@jonathan-innis havent been able to test in detail, but havent seen this happen with 10k limit set, instead of 100

jonathan-innis commented 1 year ago

We're picking up a fix for this issue so I'm going to re-open it for tracking.

jonathan-innis commented 1 year ago

I think the proposed solution here was that we can just exit scheduling early when we have simulated capacity and we recognize that one of our provisioners might exceed the limit.

github-actions[bot] commented 8 months ago

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Bryce-Soghigian commented 5 months ago

/remove-lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/karpenter/issues/734#issuecomment-2305110923): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.