CAPA / CAPI support and documentation questions

Skarlso commented 1 year ago

Hello! 👋

Getting right to it there is this doc: https://github.com/aws/karpenter/blob/main/designs/aws-launch-templates-options.md#capi-integration

This says that CAPI integration is discussed in a different doc. Can someone please point me to that doc? :) That would be awesome.

I'm trying to get integration with CAPA started with Karpenter and I was wondering what elements/objects/components CAPA could/should/shouldn't manage with Karpenter. Any help is much appreciated. Cheers!

ellistarn commented 1 year ago

Hey there! This hasn't been worked on yet, but many folks have expressed interest. The high level so far for a CAPI provider for Karpenter has been:

Build a universal CAPI provider, rather than CAPA, etc
provisioner.spec.providerRef should probably point to a CAPA machine template
The template should be annotated w/ well known labels (e.g. instance type, etc) that karpenter can use for binpacking
Likely some challenges around Karpenter's assumption that it knows the Node name synchronously -- can potentially block/wait until the node corresponding to the Machine comes online.

Skarlso commented 1 year ago

Thanks for all the info @ellistarn! I'll keep an eye open on this and try to achieve something in the meantime with what we have. :)

paurosello commented 1 year ago

Hello! is there any effort ongoing to find the best way to integrate karpenter with CAPA as of right now?

I have been able to make karpenter boot nodes and join the cluster, the part that is currently missing is registering the Machines somehow in the Management Cluster and handle the upgrades (currently the only way I found was to set a low TTL node that will eventually roll the whole cluster).

jackfrancis commented 1 year ago

My initial thoughts on integrating into Cluster API (CAPI) are focused on how we deal w/ CAPI's idea of machines being replicas (MachineDeployment resource is CAPI's analogy to k8s's Deployment resource). The canonical use-case for CAPI is you define a MachineDeployment and then scale it out or in via the replicas field. The spec for a Machine doesn't strictly forbid the notion of heterogenous VM offerings/SKUS:

https://github.com/kubernetes-sigs/cluster-api/blob/v1.4.2/api/v1beta1/machine_types.go#L79-L131

Where things get interesting is in the actual provider implementation of a machine (note the InfrastructureRef field in CAPI's MachineSpec, which will reference a provider-specific machine implementation "template" or "recipe". Here's what AWS's CAPI provider (CAPA) looks like:

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/v2.1.1/api/v1beta2/awsmachine_types.go#L46-L159

Note especially above the InstanceType field, which in its current design is meant to be a common property replicated equivalently across all AWSMachine resources in a CAPA cluster. I.e., if the value is m4.xlarge then that means that all of the AWSMachines (which ultimately underlie Kubernetes nodes) during a scale out event (increase the replicas field of the parent CAPI Machine resource) will be running on m4.xlarge instances.

So, what I think this means in terms of the optimal integration point: we can probably try as a first pass implementing an additional Machine spec for each CAPI provider that wishes to implement a karpenter provisioner, in the existing provider project (for example, CAPA, CAPZ, CAPG). And that new spec would not include a "VM type" as a 1st class, source-of-truth, declarative config, but would rather include the necessary configuration inputs (VM types, pricing models, spot configuartion, etc) for karpenter to create new nodes when the replica count increases. The actual VM type chosen could then be "demoted" to a status field in the resultant, new spec (let's call it AWSKarpenterMachine).

I think something like the above could work to best leverage the existing CAPI ecosystem and minimize the amount of net new effort to create this new solution from scratch.

cc'ing CAPI project maintainers @fabriziopandini @vincepri @sbueringer @CecileRobertMichon to get their thoughts on this, I know it's a mouthful!

CecileRobertMichon commented 1 year ago

So, what I think this means in terms of the optimal integration point: we can probably try as a first pass implementing an additional Machine spec for each CAPI provider that wishes to implement a karpenter provisioner, in the existing provider project (for example, CAPA, CAPZ, CAPG). And that new spec would not include a "VM type" as a 1st class, source-of-truth, declarative config, but would rather include the necessary configuration inputs (VM types, pricing models, spot configuartion, etc) for karpenter to create new nodes when the replica count increases. The actual VM type chosen could then be "demoted" to a status field in the resultant, new spec (let's call it AWSKarpenterMachine).

I like where this is going. I wonder if we could instead of introducing an additional spec, make the VM size optional in the providers (that would be breaking but maybe v1beta2?) and then define the "necessary configuration inputs" for the node in the CAPI Machine itself, i.e. I need this much mem/cpu, max price, optimized to run this type of workload etc. (also all optional). Potentially the VM size could still exist as an explicit override that takes precedence.

Also tagging @elmiko for thoughts

elmiko commented 1 year ago

i like what @jackfrancis is thinking, i also think there will be some cool interactions between how Karpenter works and the CAPI MachinePool type. i'm not sure how the instance sizes will get communicated but it seems like a natural fit to me.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

elmiko commented 4 months ago

we are working towards a cluster api enhancment (CAEP) from the capi karpenter feature group, perhaps we should update the doc to point users to the feature group's work?

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 3 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/karpenter/issues/747#issuecomment-2038220551): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

elmiko commented 3 months ago

i'm happy to allow this to close or keep it open for tracking, whatever folks prefer.

for anyone who is curious about the karpenter provider cluster-api, please come visit the cluster api karpenter feature group.

vazkarvishal commented 1 month ago

i'm happy to allow this to close or keep it open for tracking, whatever folks prefer.

for anyone who is curious about the karpenter provider cluster-api, please come visit the cluster api karpenter feature group.

I am super interested to know where this goes, as it would be a blocker for folks who have already adopted Karpenter on AWS and now also on Azure to switch to CAPA/CAPI for cluster management.

elmiko commented 1 month ago

for now, the best way to follow the progress is to attend our feature group meetings, or review the agenda. i try to keep notes there, and we do record the meeting.

kubernetes-sigs / karpenter

CAPA / CAPI support and documentation questions #747