kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
539 stars 180 forks source link

Mega Issue: Karpenter doesnt support custom resources requests/limit #751

Open prateekkhera opened 2 years ago

prateekkhera commented 2 years ago

Version

Karpenter: v0.10.1

Kubernetes: v1.20.15

Expected Behavior

Karpenter should be able to trigger an autoscale

Actual Behavior

Karpenter isnt able to trigger an autoscale

Steps to Reproduce the Problem

We're using Karpenter on EKS. We have pods that has custom resource requests/limits in their spec definition - smarter-devices/fuse: 1. Karpenter seems to not respecting this resource and fails to autoscale and the pod remains to be in pending state

Resource Specs and Logs

Provisioner spec

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  limits:
    resources:
      cpu: "100"
  provider:
    launchTemplate: xxxxx
    subnetSelector:
      xxxxx: xxxxx
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - m5.large
    - m5.2xlarge
    - m5.4xlarge
    - m5.8xlarge
    - m5.12xlarge
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  ttlSecondsAfterEmpty: 30
status:
  resources:
    cpu: "32"
    memory: 128830948Ki

pod spec

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fuse-test
  labels:
    app: fuse-test
spec:
  replicas: 1
  selector:
    matchLabels:
      name: fuse-test
  template:
    metadata:
      labels:
        name: fuse-test
    spec:
      containers:
      - name: fuse-test
        image: ubuntu:latest
        ports:
          - containerPort: 8080
            name: web
            protocol: TCP
        securityContext:
          capabilities:
            add:
              - SYS_ADMIN
        resources:
          limits:
            cpu: 32
            memory: 4Gi
            smarter-devices/fuse: 1  # Custom resource
          requests:
            cpu: 32
            memory: 2Gi
            smarter-devices/fuse: 1  # Custom resource
        env:
        - name: S3_BUCKET
          value: test-s3
        - name: S3_REGION
          value: eu-west-1

karpenter controller logs:

controller 2022-06-06T15:59:00.499Z ERROR controller no instance type satisfied resources {"cpu":"32","memory":"2Gi","pods":"1","smarter-devices/fuse":"1"} and requirements kubernetes.io/os In [linux], karpenter.sh/capacity-type In [on-demand], kubernetes.io/hostname In [hostname-placeholder-3403], node.kubernetes.io/instance-type In [m5.12xlarge m5.2xlarge m5.4xlarge m5.8xlarge m5.large], karpenter.sh/provisioner-name In [default], topology.kubernetes.io/zone In [eu-west-1a eu-west-1b], kubernetes.io/arch In [amd64];

njtran commented 2 years ago

Looks like you're running purely into the CPU resources here. I added the feature label as it looks like you're requesting to be able to add custom resources into the ProvisionerSpec.Limits?

ellistarn commented 2 years ago

@njtran , this is the bit:

smarter-devices/fuse: 1 # Custom resource

ellistarn commented 2 years ago

As discussed on slack:

@Todd Neal and I were recently discussing a mechanism to allow users to define extended resources that karpenter isn't aware of. Right now, we are aware of the extended resources on specific EC2 instance types, which is how we binpack them. One option would be to enable users to define a configmap of [{instancetype, provisioner, extendedresource}] that karpenter could use for binpacking.

prateekkhera commented 2 years ago

Thanks @ellistarn - the proposed solution looks good. Sorry for asking, but any ETA on this? as we're unable to use Karpenter because of this.

CodeBooster97 commented 2 years ago

I'm having the same issue with vGPU.

parmeet-kumar commented 2 years ago

@ellistarn Hope you are doing well ! I encountered the same issue while working on karpenter, So wanted to know it's been implemented via. any existing PR ?

ellistarn commented 2 years ago

This isn't currently being worked on -- we're prioritizing consolidation and test/release infrastructure at the moment. If you're interested in picking up this work, check out https://karpenter.sh/v0.13.1/contributing/

universam1 commented 2 years ago

For us this is a blocking issue with Karpenter. Our use case is fuse and snd devices that are created as custom device resources from smarter device manager

As a simpler workaround @ellistarn @tzneal why not just ignore resources that Karpenter is unaware of? Instead of having to create a configMap as a whitelist, Karpenter could just filter down well-known resources and act upon those, but ignore other resource is has no idea of. It can't do anything good about those anyway...

Taking this error message:

Failed to provision new node, incompatible with provisioner "default", no instance type satisfied resources {....smarter-devices/fuse":"2"} ...

it looks like Karpenter has all information available of "manageable" resources and those that are not?

ghost commented 1 year ago

I'm having the same issue with hugepages

universam1 commented 1 year ago

https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/3717

james-callahan commented 1 year ago

We also need this, for nitro enclaves.

lzjqsdd commented 1 year ago

we also need this when using "fuse" device plugin resoruce, here is what we met and currently working around this issue. #308

bryantbiggs commented 1 year ago

If Karpenter were able to support https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ in its compute scheduling, would that satisfy the different devices listed on this thread?

Note: This is only an alpha feature in 1.27 so still early days - but it does look like the "correct" avenue from a Kubernetes perspective

james-callahan commented 1 year ago

If Karpenter were able to support https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ in its compute scheduling, would that satisfy the different devices listed on this thread?

Note: This is only an alpha feature in 1.27 so still early days - but it does look like the "correct" avenue from a Kubernetes perspective

I think so? In that with some effort all custom resources could be re-written as dynamic resource allocations.

This is probably a good fit for nitro enclaves; but probably a bad fit for e.g. hugepages.

Likely Karpenter will need to gain support for both.

project-administrator commented 10 months ago

We tried enabling the hugepages on all nodes with sysctl "vm.nr_hugepages" = "2048" and transparent_hugepage = ["always"].

After this karpenter went crazy spinning up 50 new worker nodes for one of the existing pods. That pod does not have anything related to hugepages, just a RAM requests of 3Gb. Karpenter spins up a new node with 8Gb of RAM, then scheduler is not able to run the pod on the new node (because part of RAM is reserved for the hugepages). After that karpenter spins up the new node (again with 8Gb of RAM), and once again scheduler can't run the pod.

It looks like hugepages Linux option messes up the karpenter ability to calculate the memory resources properly.

Even without having any custom resources requests/limits set, only the existence of some custom resources might be enough to introduce problems with karpenter behavior.

chomatdam commented 8 months ago

We're facing the same issue with KubeVirt. Given that it's been ongoing for a while, it might be good to consider both a short-term solution to unblock and a long-term solution?

I noticed PR https://github.com/kubernetes-sigs/karpenter/pull/603 mentioning the deprecated Karpenter config map and a Slack conversation started here. As an alternative, I created a fork using the same approach but sourcing configuration from options (arg or environment variable). Would this be an interesting direction to explore? Or is the current state of this issue more "not a priority, maintain your forks until we have a better design / long-term approach for it"?

garvinp-stripe commented 6 months ago

Bringing in some context on huge pages which I think is more problematic that just "defining custom allocable". Huge pages are essentially a user configurable based on a mix of instance type + user's need. That means that you could have different huge page allocable even within the same instance type based on what the node is used for. To add to this problem, hugepages are pre-allocated at boot time set at the linux level so it at best can be set in NodeClass level and must be passed through via startup script of the node BUT because nodeclasses can be used by different instance types nodeclass itself cannot be relied upon to know ahead of time how much hugepage resource can be provided.

What does this all mean? Implementation would be difficult because if we want Karpenter to work Karpenter would need to know ahead of time a mapping of all permutation of instance type + possible hugepages. This means user must input a mapping in node pool for instance types to huge pages resource in addition to instance type to huge page in nodeclass to specify how instances come up.

I am likely missing a few pieces of this puzzle but this what I think needs to be solved for hugepages

jonathan-innis commented 6 months ago

I am likely missing a few pieces of this puzzle but this what I think needs to be solved for hugepages

I think there are a couple simplifications that we could do here to support hugepages if we wanted to:

  1. Consider the entire available memory to be used for hugepages. Add up all of the hugepages into the resource requests for the NodeClaim and then launch an instance, configuring the startup script to start with that many hugepages so that all of the pods can schedule
  2. Allow users to configure a percentage of the memory to be allocated to hugepages. We would calculate the hugepages during the GetInstanceTypes() call and then use that for scheduling. If we allowed this, this would most likely be a setting on the NodeClass and then we would just pass it down through the GetInstanceTypes() call from the CloudProvider.
garvinp-stripe commented 6 months ago

Consider the entire available memory to be used for hugepages. Add up all of the hugepages into the resource requests for the NodeClaim and then launch an instance, configuring the startup script to start with that many hugepages so that all of the pods can schedule

One issue here is huge pages are carved out of memory, for us this doesn't matter because we actually do want to move to all huge pages but most users likely have a mix of huge pages and normal memory. If you advertise all huge pages then your nodes technically doesn't have memory

Allow users to configure a percentage of the memory to be allocated to hugepages. We would calculate the hugepages during the GetInstanceTypes() call and then use that for scheduling. If we allowed this, this would most likely be a setting on the NodeClass and then we would just pass it down through the GetInstanceTypes() call from the CloudProvider.

That feels reasonable and removes the need to map instance types to huge pages. Once again that works for us but I am unsure if other users have more unique configuration

jonathan-innis commented 6 months ago

all huge pages but most users likely have a mix of huge pages and normal memory

Sure, when you are calculating the total of all of your huge-pages, you would just have to also subtract that away from the memory requests because you intuitively know that one takes away from the other.

Once again that works for us but I am unsure if other users have more unique configuration

Yeah, it's a little tough to boil the ocean here without creating wayyyyy too much configuration and making this likely unreasonable to manage for users.

garvinp-stripe commented 6 months ago

Yeah, it's a little tough to boil the ocean here without creating wayyyyy too much configuration and making this likely unreasonable to manage for users.

Agreed. I think this general approach should work, need some time to bake in my head if we would encounter any issues.

uniemimu commented 6 months ago

We also need this feature. Our use-case is related to a controller which creates extended resources to nodes immediately when a new node is created. Karpenter will not create a node for pods using such extended resources, because it doesn't understand the extended resources.

In our case, using node affinity and node selectors together with existing node labels is sufficient to direct Karpenter to pick a good node. The only thing we need is Karpenter to ignore a list of extended resources, when finding the correct instance type. Having said that, I do have a forked workaround, but forked workarounds are not acceptable where I work, for good reason.

Having ignorable extended resources wouldn't be new in Kubernetes. They exist also in the scheduler.

garvinp-stripe commented 6 months ago

How much appetite is there to simply have an override config map that has per instance type override on resources capacity just for Karpenter simulation (Support for huge pages and possibly other extend resources!?!)? https://github.com/aws/karpenter-provider-aws/blob/main/pkg/providers/instancetype/types.go#L179

Config map that as instance types + any resource override and if a particular resource isn't override, take what is provided from cloud provider.

Pin the configmap per NodeClass via a new setting on NodeClass instanceTypeResourceOverride. Note that changes to the configmap won't be reflected on current nodes, we would use drift to reconcile the changes.

This push the onus onto the users to ensure that their overrides are correct. We won't provide any sophisticated pattern matching and users can build their own generator for making this map.

apiVersion: v1
kind: ConfigMap
metadata:
  name: karpenter-instance-type-resource-override-config
  namespace: karpenter
data:
  nodepoolexample.overrides: |
    {
       m5.xlarge: {
            memory: 4 GiB,
            hugepages-2Mi: 10GiB,

       }

    }
johngmyers commented 6 months ago

Hopefully users wouldn't need to maintain their own list of acceptable instance types in order to handle the "fuse" use case, as fuse doesn't depend on particular instance types.

It's a bit frustrating that the fuse use case is being held up by hugepages. The fuse use case is probably common enough to justify being handled out of the box.

GnatorX commented 6 months ago

I think fuse's use case is not the same as hugepages and shouldn't be tied together. Fuse likely can do https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ?

uniemimu commented 6 months ago

I think fuse's use case is not the same as hugepages and shouldn't be tied together. Fuse likely can do https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ?

In its current form DRA does not work with cluster autoscalers. Some future versions of DRA might work with cluster autoscalers, but such a version isn't available yet.

The current DRA relies on a node-level entity, namely the resource driver kubelet plugin daemonset, which will not deploy before the node is created. Since cluster autoscalers don't know anything about DRA, they will not create a node for a pending pod that requires DRA resource claims. DRA users are in the same limbo as are the extended resource users. The cluster autoscaler can't know whether the new resources will pop up in the node as a result of some controller or daemonset. Maybe they will, maybe they won't.

I'm all for giving the users the possibility to configure the resources for Karpenter in a form of a configmap or CRD or similar. A nice bonus would be if one could also define extended resources which are applied to all instance types, covering in a simple fashion the fuse-case.

GnatorX commented 6 months ago

A nice bonus would be if one could also define extended resources which are applied to all instance types, covering in a simple fashion the fuse-case.

That feels fine also. Let me try to bring this up during working group meeting

Bourne-ID commented 5 months ago

Curious if this could be a configuration on the NodePool; we're able to add custom Requirements to allow Karpenter to schedule when hard affinities or tolerances are defined. Would having an entry to define capacity hints to karpenter 'this node pool will satisfy requests/limits for [custom capacity]' be an option?

My use case is smarter-devices/kvm - which can be filtered on a nodepool as metal. I could imagine the same for hugePages or similar - we know what the instances which has these so we can filter them using custom NodePools.

By using weighting we can define these after the main nodepools - so in my example, I would have Spot for all weight 100, on-demand for all weight 90, and then our KVM with capacity hints of 80.

In the mean time, I'm using an Overprovisioner marked with hard affinity for metal instances to ensure these pods can be scheduled; it's a tradeoff with extra cost but the ability to use Karpenter exclusively.

ellistarn commented 5 months ago

I wonder if this is something that might be useful to configure both at the node pool level, and at the instance type level. Ultimately, we were learning away from an InstanceTypeOverride CRD due to the level of effort to configure it, but perhaps with support for both, it provides an escape hatch as well as the ability to define a simple blanket policy.

We could choose any/all of the following:

  1. Cloudprovider automatically knows extended resource values (e.g. GPU)
  2. NodePool (or class) lets you specify a flat resource value per node pool
  3. NodePool (or class) lets you specify a scalar resource values (e.g. hugePageMemoryPercent)
  4. InstanceType CRD (or config map) let's you define per instance type resource overrides.
fmuyassarov commented 5 months ago

/cc

Bryce-Soghigian commented 5 months ago

/assign Bryce-Soghigian

AaronFriel commented 5 months ago

I'm running into this as well, and I'd very much like a solution like this:

  nodepoolexample.overrides: |
    {
       m5.xlarge: {
            memory: 4 GiB,
            hugepages-2Mi: 10GiB,

       }

    }

Albeit as a NodePool configuration to specify manual node resources. My reasoning is a bit different than the fuse use case, but I think explains why it would be important for NodePool to have this capability.

First, the field of accelerators is changing rapidly, e.g.: NVidia Multi Instance GPU resources are complex and not stable. I don't think Cloud providers will keep up to date with what Nvidia's drivers ship.

Second, as evidenced by the above, resources can be hierarchical, and Kubernetes may eventually adapt to support complex hierarchies like so:

image

Users may wish to manually specify one NodePool which provides 1 7g.40gb per A100, and another NodePool for smaller models to more densely pack with 7 1g.5gb resources per A100. Allowing manual overrides allows to make better use of cloud resources, as the current accelerator resource labels are too coarse.

Bryce-Soghigian commented 5 months ago

https://docs.google.com/document/d/1vEdd226PYlGmJqs6gWlC2pTyDKhZE8DyCU2SbNB35wM/edit Looking for user stories from customers on their extended resources support story here. Please leave some comments here!

After we feel confident we have captured all the critical usecases, I will go through and propose some RFCs to solve the various dimensions of these problems.

AaronFriel commented 5 months ago

@jmickey has captured my comments above in the doc.

ellistarn commented 3 months ago

For anyone watching this issue, I have a proof of concept to solve this problem here: https://github.com/kubernetes-sigs/karpenter/pull/1305

daverin commented 1 month ago

Searching for a fix.

All I want is for Karpenter to ignore this custom resource.

My current workaround is absolutely hideous:

         resources:
            requests:
              cpu: 4000m
              memory: 16Gi
            limits:
              cpu: 11000m
              memory: 24Gi
              xilinx.com/fpga-xilinx_u30_gen3x4_base_2-0: 1
              # ---
              # karpenter will not provision this node if this custom device is here. 
              # To provision, comment out, wait for launch and daemon set to provision
              # then uncomment, sync kustomization, then kill the old pod

Does anyone have a better workaround?

poussa commented 1 week ago

I guess this feature request did not make the v1.0 release. Can someone confirm?