Mega Issue: Manual node provisioning

ellistarn commented 2 years ago

Tell us about your request What do you want us to build?

I'm seeing a number of feature requests to launch nodes separately from pending pods. This issue is intended to broadly track this discussion.

Use Cases:

Create a System pool to run components like karpenter, loadbalancer, coredns, etc
Provision baseline capacity that never scales down
Manually preprovision a set of nodes before a large event

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

ellistarn commented 2 years ago

One design option would be to introduce a Node Group custom resource that maintains a group of nodes with a node template + replica count. This CR would be identical to the Provisioner CR, except TTLs are replaced with replicas.

apiVersion: karpenter.sh/v1alpha5
kind: NodeGroup
metadata:
  name: default
spec:
  replicas: 1
  taints: [...]
  requirements:
    - key: karpenter.k8s.aws/instance-size
      operator: In
      values: ["large"]
    - key: karpenter.k8s.aws/instance-family
      operator: In
      values: ["c5", "r5", "m5"]
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
  providerRef:
    name: my-provider
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: my-provider
spec:
  subnetSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}" 
  securityGroupSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"

FernandoMiguel commented 2 years ago

most of us have aws managed node groups on ASG with at least 2 nodes to handle this

ellistarn commented 2 years ago

most of us have aws managed node groups on ASG with at least 2 nodes to handle this

I agree that many of these cases are handled by simply using ASG or MNG. Still worth collating these requests to see if this assumption is bad for some cases.

FernandoMiguel commented 2 years ago

most of us have aws managed node groups on ASG with at least 2 nodes to handle this

I agree that many of these cases are handled by simply using ASG or MNG. Still worth collating these requests to see if this assumption is bad for some cases.

I would love to have karpenter handle if all... But we still need a place to run karpenter from.

And only dirty way I see it doing that is to deploy an ec2, deploy karpenter there with two replicas with anti affinity hostname, karpenter would deploy a second node now managed by karpenter, deploy the second replica, and kill off the first manually deployed vm.

Or we can just have it tagged with something that makes karpenter manage it until it hits its TTL.

gazal-k commented 2 years ago

We're considering running karpenter and coredns on Fargate and karpenter then provisioning capacity for everything else.

I believe there was some documentation about this somewhere. Also, it was an AWS SA who suggested also running coredns on Fargate (we were originally thinking about just running karpenter on Fargate)

gazal-k commented 2 years ago

Manually preprovision a set of nodes before a large event

For this use case, wouldn't changing the minReplicas on desired application HPAs work better? That's what we do, so that there is no delay in spinning up more Pods for a rapid surge in traffic.

FernandoMiguel commented 2 years ago

We're considering running karpenter and coredns on Fargate and karpenter then provisioning capacity for everything else.

I tried that a couple weeks ago and was a very frustrating experience with huge deployment times, and many many timeouts and failures. Not the best ops experience. Until coredns is fargate native without stupid hacks to modify the deployment, I don't believe this is the best path.

gazal-k commented 2 years ago

I tried that a couple weeks ago and was a very frustrating experience with huge deployment times, and many many timeouts and failures. Not the best ops experience. Until coredns is fargate native without stupid hacks to modify the deployment, I don't believe this is the best path.

Does this not work: https://docs.aws.amazon.com/eks/latest/userguide/fargate-getting-started.html?

FernandoMiguel commented 2 years ago

I tried that a couple weeks ago and was a very frustrating experience with huge deployment times, and many many timeouts and failures. Not the best ops experience. Until coredns is fargate native without stupid hacks to modify the deployment, I don't believe this is the best path.

Does this not work: https://docs.aws.amazon.com/eks/latest/userguide/fargate-getting-started.html?

We are a terraform house, so the steps are slightly different. I've had eks clusters with fargate only workloads work a few times, but it's really hit and miss kinda of deployments. Patching coredns is and hard problem.

ermiaqasemi commented 2 years ago

In some cases, especially in critical workloads that always need some node to be up and running like auto-scaling, It will be great if we can set that X minimum node is always available for scheduling nodes.

realvz commented 2 years ago

Another use case is when you'd like to prewarm nodes at scheduled times. Currently, customers have to wait for Karpenter to provision nodes when pods are pending or create dummy pods that trigger scaling before production workload begins. Reactive scaling is slow and the alternative seems a workaround.

Ideally, customers should be able to create a provisioner schedule that creates and deletes nodes based on a defined schedule. Alternatively, Karpenter can have a CRD that customers can manipulate themselves to precreate nodes (without having pending pods).

cove commented 2 years ago

our use case is we need to increase our node count during version upgrades which can take hours/days, during that time we cannot have any scale downs, so being able to have our upgrading app be able to manually control what's going on would be ideal.

(also for context, our case isn't a web app, but an app the maintains a large in memory state that needs to be replicated during an upgrade, before being swapped out.)

mattsre commented 2 years ago

Following from the "reserve capacity" or Ocean's "headroom" issue here: https://github.com/aws/karpenter/issues/987

Our specific use case is we have some vendored controller that polls an API for workloads, and then schedules pods to execute workloads as they come in. The vendored controller checks to see if nodes have the resources to execute the workload before creating a pod for it. Because of this, no pods are ever created once the cluster is considered "full" by the controller. We've put in a feature request to the vendor to enable a feature flag on this behavior, but I still think there could be benefit to having some headroom functionality as described in Ocean docs here: https://docs.spot.io/ocean/features/headroom for speedier scaling

Maybe headroom could be implemented on a per provisioner level? The provisioners already know exactly how much cpu/memory they provision, and with the recent consolidation work I'd assume there's already some logic for knowing how utilized the nodes themselves are.

grosser commented 2 years ago

Fyi we add headroom to our clusters by scheduling low priority pods, but it sounds like that would not work for your case either.

On Sun, Nov 6, 2022, 1:22 PM Matt Conway @.***> wrote:

Following from the "reserve capacity" or Ocean's "headroom" issue here: aws/karpenter#987 https://github.com/aws/karpenter/issues/987

Our specific use case is we have some vendored controller that polls an API for workloads, and then schedules pods to execute workloads as they come in. The vendored controller checks to see if nodes have the resources to execute the workload before creating a pod for it. Because of this, no pods are ever created once the cluster is considered "full" by the controller. We've put in a feature request to the vendor to enable a feature flag on this behavior, but I still think there could be benefit to having some headroom functionality as described in Ocean docs here: https://docs.spot.io/ocean/features/headroom for speedier scaling

Maybe headroom could be implemented on a per provisioner level? The provisioners already know exactly how much cpu/memory they provision, and with the recent consolidation work I'd assume there's already some logic for knowing how utilized the nodes themselves are.

— Reply to this email directly, view it on GitHub https://github.com/aws/karpenter-core/issues/749, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACYZ6CVCEGUREI4XLWU5DWHAHQRANCNFSM52W45W4A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

grosser commented 2 years ago

FYI here is a rough draft how I think this feature could look like ... basically have a new field on the provisioner and then add fake pods before the scheduler makes it's decisions https://github.com/aws/karpenter-core/pull/62

crazywill commented 1 year ago

We would also hope Karpenter to support warm pool. Right now it takes 6 minutes to spin up a node, which is too long for us. We would like to have a warm pool feature similar to asg.

abebars commented 1 year ago

+1 to warm pool, I opened another issue which is more towards the warm pool options. @crazywill feel free to chime in there if you have a chance

jackfrancis commented 1 year ago

Adding some thoughts here.

We could emulate cluster-autoscaler's min-replica-count and max-nodes-total approach?

If you set a minimum, then the karpenter provisioner will, if necessary, statically scale out to that number of nodes
If you set a maximum, then the karpenter provisioner will not scale beyond that node count
If you set minimum and maximum to the same value, the karpenter provisioner will set the cluster node count to a static number (the value of minimum and maximum)
If you do not provide a minimum configuration, the default is "no minimum" (effectively we give karpenter provisioner permission to scale in to zero nodes)
If you do not provider a maximum configuration, the default is place no node count limit on karpenter provisioner's ability to create more nodes, if necessary

One obvious difference between cluster-autoscaler and karpenter is that by design "number of nodes" is not a first class operational attribute in terms of describing cluster node capacity (because nodes are not homogeneous). So using "minimum number of nodes" to express desired configuration for solving some of the stories here (specifically the warm pool story) isn't sufficient by itself: you would also need "type/SKU of node". With both "number" + "SKU" you can deterministically guarantee a known capacity, and now you're sort of copying the cluster-autoscaler approach.

However, the above IMO isn't really super karpenter-idiomatic. It would seem better to express "guaranteed minimum capacity" in a way that was closer to the operational reality that actually informs the karpenter provisioner. Something like:

minCPURequestsAvailable
minMemRequestsAvailable

etc.

Basically, some sufficient amount of input that karpenter could use to simulate a "pod set" to then sort of "dry run" into the scheduler:

Take my existing Ready nodes and assume that nothing is actually running on them
Do said Ready nodes fulfill the "dry run" configuration? (e.g., could I schedule 500 CPU cores, 1TB memory?, whatever)

It gets trickier when you consider the full set of critical criteria that folks use in the pod/scheduler ecosystem: GPU, confidential compute, etc. But I think it's doable.

ellistarn commented 1 year ago

Thanks for writing this up! Bouncing some thoughts.

So using "minimum number of nodes" to express desired configuration for solving some of the stories here (specifically the warm pool story) isn't sufficient by itself: you would also need "type/SKU of node".

Running a wide open provisioner with a fixed replica count is definitely a weird UX. One of the interesting things I've learned from Karpenter customers is that most aren't running with wide open Provisioner requirements, they've struck a middle ground between rigid single-dimension pools and the full flexibility possible with Karpenter.

It's not uncommon for users to run with { family = c,m,r; size = 2xlarge,4xlarge }. In this case, if you ran with replicas = 3, the semantic makes much more sense. "Give me the cheapest 3 nodes, within these families and sizes, whatever is available". In an unconstrained market, you'd get the same (cheapest) every time. But it's especially valuable for spot markets, where the prices may shift over time, and you're happy to get whatever capacity is available. The semantic ends up as something like "each node must have at least this much capacity". FWIW, customers can also configure flexible node groups within ASG and EKS Node Groups for a similar semantic.

The main difference in my mind, is that this semantic is reasonable for static capacity, but for dynamic capacity, Karpenter's current approach is able to be much more flexible without the overhead of many groups. This makes me think that the spec.requirements for a replica driven vs pending pod driven Provisioner would be fundamentally different.

We could emulate cluster-autoscaler's min-replica-count and max-nodes-total approach?

Because of the above, this hybrid mode is where the awkwardness comes in. You'd want tight requirements to bound minReplicas, but then loose requirements for any additional capacity. Instead, would it make more sense to just define two provisioners? One to define the static capacity, and another to define the dynamic capacity?

ellistarn commented 1 year ago

However, the above IMO isn't really super karpenter-idiomatic. It would seem better to express "guaranteed minimum capacity" in a way that was closer to the operational reality that actually informs the karpenter provisioner. Something like:

minCPURequestsAvailable minMemRequestsAvailable

The challenge with this is that it neither has the determinism of the replica based approach, nor the critical context of pod scheduling information. There's no way to control whether or not you get 64 x 1 core machine vs 1 x 64 core machine. You can make this decision much more effectively once you add a set of pods and their scheduling constraints to the mix, but once you have that information, you no longer need to know min, since you can sum the pods' resource requests.

The way I think about this is that there are two personas (or maybe two use cases for one persona) when configuring capacity.

Infrastructure driven capacity (node replica counts + requirements)
Application driven capacity (pod scheduling constraints)

The former is more traditional and manual, but works beautifully for simple use cases like small clusters or capacity pools for system components.

jackfrancis commented 1 year ago

@ellistarn it sounds like what you're suggesting is that we turn the node provisioner into a sort of set of precedence-ordered provisioners (in this case 2 provisioners):

First order precedence: if exists, ensure that infra capacity minimums are fulfilled (a smaller set of appropriate family/size configuration criteria) as defined by "minimum capacity provisioner"
Second order precedence: once the first order precedence item is fulfilled, dynamically scale according to criteria defined in "dynamic capacity provisioner"

There are probably more karpenter-idiomatic ways to classify aws/karpenter#1 and aws/karpenter#2 than "minimum capacity provisioner" and "dynamic capacity provisioner", but hopefully the point is made.

ellistarn commented 1 year ago

There are probably more karpenter-idiomatic ways to classify https://github.com/aws/karpenter/pull/1 and https://github.com/aws/karpenter/pull/2 than "minimum capacity provisioner" and "dynamic capacity provisioner", but hopefully the point is made.

This is all fresh ideation, so the terms aren't settled, but I've been using "Static Capacity" and "Dynamic Capacity" in my head to reason about these.

once #\1 is fulfilled dynamically scale

Yes -- which also supports the case where #\1 isn't defined, where you're scaling from 0 in all cases. It also supports the case where you may have different pools of capacity (protected by taints, etc) where some are static+dynamic, others are just static or just dynamic.

Happy to brainstorm more on slack if easier :)

FernandoMiguel commented 1 year ago

for warm nodes, the hacky approach of having pause pods deployed seems to be the best for each practitioner, cause each can request the closer resources and affinity they need for their workloads, instead of some static CPU capacity.

jackfrancis commented 1 year ago

@FernandoMiguel you seem to be describing a scenario of mixed node capabilities (from the use of the term affinity), and that each capability will have different overhead requirements, and that specific "empty" pods w/ that node affinity metadata are the best way to ensure that the node overhead is present across all node types prior to the business cycle

Is that more or less what you're doing?

FernandoMiguel commented 1 year ago

@jackfrancis ww currently aren't implementing warm nodes... Luckily

ellistarn commented 1 year ago

Just adding this to the brainstorming pipe after a chat with @jackfrancis

apiVersion: karpenter.sh/v1
kind: Headroom
spec:
  policy: Baseline | Headroom | Proportional
  // scheduling constraints
  tolerations:
  nodeAffinity:
  affinity:
  replicas:

jackfrancis commented 1 year ago

@ellistarn the convo w/ @FernandoMiguel above suggests if we consider a route like that we'd want to accommodate sets of Headrooms and make them additive, to be able to sum mutually-exclusive configuration singletons (if that makes sense).

sjmiller609 commented 1 year ago

Coming from this issue.

Our reasons a configurable headroom:

When we get a spot interrupt, fast relocation. some of our workloads are not HA so this period is downtime (for example, Postgres with 1 replica). While some downtime is acceptable, best to minimize it.
Startup time is end-user facing. In a platform as a service model, an end user clicks a button, and the platform deploys workloads for them. Karpenter does great at consolidation, but this also results in high frequency of new deployments requiring new nodes. It would be better to prepare as much as possible ahead of time for our users (for example, start node, have images already pulled).

The ideal situation for us would be to have the headroom configurable as a percentage of the cluster, where it's calculated as:

PercentOverhead is configurable NumberOfNodesOverhead = ceil( number of nodes in cluster * PercentOverhead) ResourcesOverhead = resources_of_largest_x_nodes( NumberOfNodesOverhead )

For examples:

If we have 10% overhead configured, and we have 3 nodes over varying sizes, this would allow for any 1 node to be interrupted but its resources are still available in the cluster.

ceil(3 * .10) = 1
Find the largest node
Deploy nodes such that the resources of the largest node are available in cluster headroom

If we have 10% overhead configures, and we have 11 nodes in the cluster, then any 2 can be interrupted and there is sufficient overhead:
- ceil(11 * .10) = 2
- Find the largest 2 nodes
- Deploy nodes such that the resources of the largest 2 nodes are available in cluster headroom

cest-pas-faux commented 1 year ago

Hello,

Adding my suggestions from aws/karpenter#4409 :

Combine consolidation with a "remaining resources" awareness.

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  # [...]
  consolidation: 
    enabled: true
    keepExtraResources:
      cpu: 3
      memory: 5Gi

spec:
  # [...]
  consolidation: 
    enabled: true
    ensureAvailable:
      nodes: 1
      resources:
        cpu: 3
        memory: 5Gi

sftim commented 1 year ago

https://github.com/aws/karpenter/issues/987 was closed in favor of this issue, so I'll mention what I'd like to see: a priority cutoff.

My scenario is a cluster operator who wants some reserve capacity but only at a good price.

Let's say I schedule some negative-priority Pods that add up to 12 vCPU, 2GiB of RAM per AZ. I set a pod priority of -1000000 for those Pods to make it clear that anything else should run ahead of these. The idea is that these run a pause container to take up reserved space; as soon as more important work comes in, those placeholder Pods get preempted and the real work can start on an already-running node.

Now, I also configure two provisioners. A spot provisioner and an on-demand provisioner. The spot provisioner is preferred and somehow, mainly out of scope of this comment, I mark that it should never pay more than the on-demand price (ignore reserved instances for now).

I mark the on-demand provisioner with a priority cut-off of -5000. What this would mean is that the on-demand provisioner should ignore pending Pods if they have a lower priority than that number. If there's no spot capacity, the negative-priority pods stay Pending. If the consolidation is active, the cluster gets shrunk to fit - and that might mean draining a node where the placeholder Pods are running, even when the replacement would have nowhere to go.

sftim commented 1 year ago

Also see aws/karpenter#3820

hgrant-ebsco commented 1 year ago

Not sure if this is the proper thread, but it seems to generally fit in with my company's use case. Currently with our cluster autoscaler config, we utilize the scheduled scaling feature of the autoscaling groups on development, and test clusters to scale them down nights and weekends for cost savings. No need to run infra that isn't being used. .

As we look to move from CAS, to Karpenter, we would currently lose this capability since we are relying on the ASG auto scheduling feature to do so. It would be beneficial to be able to set a schedule as part of a provisioner that would delete all karpenter managed nodes for that provisioner at a given time, and then re-enable the provisioning of those nodes at a follow up time.

sftim commented 1 year ago

Currently with our cluster autoscaler config, we utilize the scheduled scaling feature of the autoscaling groups on development, and test clusters to scale them down nights and weekends for cost savings. No need to run infra that isn't being used. .

As we look to move from CAS, to Karpenter, we would currently lose this capability since we are relying on the ASG auto scheduling feature to do so. It would be beneficial to be able to set a schedule as part of a provisioner that would delete all karpenter managed nodes for that provisioner at a given time, and then re-enable the provisioning of those nodes at a follow up time.

If you have Karpenter managing nodes, you can approximate this scheduled scale-in by adding a controller that modifies the Provisioner. Outside the time that you want capacity to be available, set a restriction on the .spec.limits.resources.cpu, which caps the total resource in that pool that the Provisioner is allowed to manage. For example, set that limit to zero. During hours of operation, remove that restriction or put it back to the value you usually want.

There are some frameworks, notably https://github.com/flant/shell-operator, that will let you build this automation without writing much code.

ccortinhas-pmi commented 1 year ago

We also have a similar use-case to others described here. We wish we could setup Karpenter to include some headroom. Be it in the form of number of instances, or resources. We could make either of them work. Karpenter works pretty well as we all know, but having to react for a sudden demand will inevitably create slower experiences to our customers while new instances are provisioned. I know this is a discussion thread to gather use-cases, thoughts and strategies. May I ask if this "feature" (in any form) will undergo development anytime soon?

cdenneen commented 1 year ago

aws/karpenter#987 was closed in favor of this issue, so I'll mention what I'd like to see: a priority cutoff.

My scenario is a cluster operator who wants some reserve capacity but only at a good price.

Let's say I schedule some negative-priority Pods that add up to 12 vCPU, 2GiB of RAM per AZ. I set a pod priority of -1000000 for those Pods to make it clear that anything else should run ahead of these. The idea is that these run a pause container to take up reserved space; as soon as more important work comes in, those placeholder Pods get preempted and the real work can start on an already-running node.

Now, I also configure two provisioners. A spot provisioner and an on-demand provisioner. The spot provisioner is preferred and somehow, mainly out of scope of this comment, I mark that it should never pay more than the on-demand price (ignore reserved instances for now).

I mark the on-demand provisioner with a priority cut-off of -5000. What this would mean is that the on-demand provisioner should ignore pending Pods if they have a lower priority than that number. If there's no spot capacity, the negative-priority pods stay Pending. If the consolidation is active, the cluster gets shrunk to fit - and that might mean draining a node where the placeholder Pods are running, even when the replacement would have nowhere to go.

@sftim you have examples of these 2 provisioners and the pause? It's definitely a use-case I'd like to implement.

sftim commented 1 year ago

Pause image: registry.k8s.io/pause:3.5

Spot provisioner

apiVersion: karpenter.sh/v42alpha0
kind: Provisioner # or NodePool
metadata:
name: spot-only
spec:
providerRef:
  name: common
  kind: EC2Provider
requirements:
- key: "karpenter.sh/capacity-type"
  operator: In
  values: ["spot"]

On-demand provisioner

apiVersion: karpenter.sh/v42alpha0
kind: Provisioner # or NodePool
metadata:
name: fallback # spot AND on-demand
spec:
# THIS IS THE IMPORTANT DETAIL
# 
# When there are unschedulable Pods, but all of the unschedulable Pods have a priority below -5000,
# don't launch any nodes. In this use case, I prefer reserve capacity but only if I can pay a keen price for it.
minimumPodPriority: -5000
providerRef:
  name: common
  kind: EC2Provider
requirements:
- key: "karpenter.sh/capacity-type"
  operator: In
  values: ["spot","on-demand"]

What'd be neat is if there was some way to say that the spot price also had to be at most 90% of the on-demand price. If spot pricing went above on-demand, that'd be a pain. I don't think AWS offers a guarantee that spot prices will always be lower than on-demand.

allamand commented 1 year ago

Hi @sftim normally just going with a single provider spot and on-demand should prioritize spot instance so with better price. AWS specify that When your Spot request is fulfilled, your Spot Instances launch at the current Spot price, not exceeding the On-Demand price. ref: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances-history.html

danielloader commented 1 year ago

One caveat here is I don't want to pay 100% of the on demand price and incur the 2 minute eviction churn.

This likely is why the previous comment set an arbitrary 90% of on demand cost as people will likely have different tolerances for the downsides.

For me some workloads only make sense to incur the node churn if the spot price is say 75% on demand or cheaper due to very large image pulls on machine learning workloads.

ivankatliarchuk commented 1 year ago

Our use case;

We are using Karpenter as our CI/CD job runner provisioner

As a result, we don't fully leverage

Workload consolidation
Resource optimisation

For our CI/CD system, the core requirements are:

Minimizing job start times
Having cost-saving features is a plus, but it should strike a balance between provisioning time and cost savings
Providing an out-of-the-box solution to identify bottlenecks and delays in node provisioning

What is required in our case;

Provision baseline capacity during peak hours, known as 'headroom'. Even better to provide scheduling capabilities.
Reactively scale when there's a spike in jobs, whether it's from 0 to 300 jobs with the same guaranteed resource request or mixed resource requests
Having a warm pool of instances (a provider-specific feature) would be beneficial
Maintain balanced scaling, ensuring an optimal cluster performance by balancing the number of nodes with the pending pods

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

z0rc commented 9 months ago

/remove-lifecycle stale

chamarakera commented 9 months ago

When migrating from Cluster Auto Scaler to Karpenter, we would like Karpenter to provision a node beforehand, before we perform a drain on the old node. It takes time for Karpenter to provision a new node based on the unscheduled workloads, and due to this, the pods are kept in a Pending state for too long.

Bryce-Soghigian commented 9 months ago

@chamarakera would just requesting a node via nodeclaim be enough in this case? No managed solution is requried if you just want to create a node from my testing just applying nodeclaims with a reference to valid nodepool is enough to create a node ahead of time in karpenter.

Note this example uses a instance type size from azure.

k apply -f nodeclaim.yaml

apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
metadata:
  name: temporary-capacity
  labels:
    karpenter.sh/nodepool: general-purpose
  annotations:
    karpenter.sh/do-not-disrupt: "true"
spec:
  nodeClassRef:
    name: default
  requirements:
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - Standard_D32_v3
  resources:
    requests:
      cpu: 2310m
      memory: 725280Ki
      pods: "7"
status: {}

njtran commented 9 months ago

@chamarakera how bad is the pod latency you're describing here? Do you have a long bootstrap/startup time on your instance? What would be an acceptable amount of pod latency to not need prewarmed instances ?

chamarakera commented 9 months ago

@Bryce-Soghigian - To use NodeClaims, we would need to generate several NodeClaim resources depending on the cluster's size. I would like to see a configurable option within the NodePool itself to provision nodes prior to the migration. I think, having a way to configure min size/max size parameters in the NodePool itself would be a good solution.

@njtran - Usually it takes for 1 - 2mins for startup, this is ok in non-prod, but in a production I would like to have pending pods scheduled in a node as soon as possible (within few seconds).

cdenneen commented 9 months ago

This is almost like implementing an overprovision pod with low Priority Class attached so it gets bumped but avoids having to wait for resources.

On Thu, Feb 8, 2024 at 12:48 AM Chamara Keragala @.***> wrote:

@Bryce-Soghigian https://github.com/Bryce-Soghigian - To use NodeClaims, we would need to generate several NodeClaim resources depending on the cluster's size. I would like to see a configurable option within the NodePool itself to provision nodes prior to the migration. I think, having a way to configure min size/max size parameters in the NodePool itself would be a good solution.

@njtran https://github.com/njtran - Usually it takes for 1 - 2mins for startup, this is ok in non-prod, but in a production I would like to have all pods scheduled in a node as soon as possible (few seconds).

— Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/karpenter/issues/749#issuecomment-1933402880, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFPZYKLMNCMLVQV3LHHWVLYSRRJTAVCNFSM6AAAAAA62KU7WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZTGQYDEOBYGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

garvinp-stripe commented 9 months ago

After chatting a bit in Slack on the Karpenter channel with awiesner4 I think i have some thoughts around this problem.

First I want to bring up Karpenter's primary objective then break down Karpenter's current responsibility and maybe this will help drive the design choice. I think Karpenter's main objective is be an efficiency cluster autoscaler so it makes it difficult for leaving around nodes that isn't doing work go against what Karpenter is trying to achieve. I think to add something that works around it would likely be problematic because you would have to work around everything Karpenter is built to do.

However, at the moment is doing more than just autoscaling which is where I think problems and usability issue arises. It autoscales but it also manages nodes. It takes over how nodes are configured and the lifecycle of nodes and it closes the door for other things to manage nodes.

https://github.com/kubernetes-sigs/karpenter/issues/742 https://github.com/kubernetes-sigs/karpenter/issues/688 https://github.com/kubernetes-sigs/karpenter/issues/740 and this issue

What does this mean? I agree with those who are saying nodeclaim should be able to create nodes essentially outside of Karpenter's main autoscaling logic so we don't change what Karpenter is trying to do (save money). I think at this time NodePools does hold the logic and concept of how Karpenter tries to optimize the cluster so I don't think manual provisioning should live there. On the max/min nodes on NodePool, it was pointed out to me that how would a nodepool know what instance types those min nodes should be and in order to protect those min nodes Karpenter would have to cut through most of its disruption logic to support don't drop node count below min.

That isn't the say supporting node management or different autoscaling priority isn't possible but I think the entity that contains that logic should not be node pool in its current form. If Karpenter expands NodePool such that it is extensible, _karpenter_nodepool_provider_N, that allows users to group nodeclaims with different intention that differs from the primary objective of Karpenter. Users can create nodepools variants where keeping a min make sense. Where scheduling logic is different and so on.

From my view, I think its important to keep Karpenter's main focus clear and clean because its complicated enough. But if we allow for more extension on the base concepts, we may be able to support use cases to fall out of what Karpenter is trying to do.

Bryce-Soghigian commented 9 months ago

However, at the moment is doing more than just autoscaling which is where I think problems and usability issue arises.

There are mentions of node autohealing, using budgets to manage upgrades, etc. Seems its moving to be a node lifecycle management tool. It has more value than just autoscaling. So I am for a static provisioning CR that helps manage lifecycle of static pools.

SatishReddyBethi commented 9 months ago

Hi. I am really looking forward to this feature too. Is there an open PR for it? Or is it still in the discussions phase?

cloudwitch commented 7 months ago

We have 9 minimum nodes in our ASG for a batch job workload that gets kicked off by users through a UI.

The users find the EC2 spin-up time unacceptable and expect their pod to spin up quicker than the EC2 can start.

We must have a way to run a minimum number of nodes in a nodepool.

kubernetes-sigs / karpenter

Mega Issue: Manual node provisioning #749

Community Note