Karpenter GA for AWS Vintage and CAPI

T-Kukawka commented 1 year ago

Current Handbook: https://handbook.giantswarm.io/docs/product/managed-apps/karpenter/

IAM/Roles - create cloud resources for Karpenter - Role, SQS - basically automatic installation of prerequisites
- we need to ensure that this will be working on both - Vintage and CAPA
Karpenter-app will have to be installed in the WCs
The Provisioners - right now we have the concept of NP which creates the ASG with given subnets, instance types etc. This is replaced in Karpenter with provisioners that take in the labels, taints, instancetypes, spots etc.
- PDBs - Karpenter will never kill the machine where pods have wrong PDBs defined - this will result in hanging nodes. We could have a cronjob that could handle the timeouts on the PDBs to free up the machines.
Upgrades - currently we do the upgrades with ASGs on AWS. Karpenter nodes are outside of the loop. We could set the TTL on the Karpenter nodes, where the machines would be rolled after the TTL expiration.
The Networking - right now we create the NP, we scale it to 1 node and we feed the Karpenter with the data from the NP (like role, subnet etc).
Cluster-autoscaler will have to be tweaked, as in current state both; the Karpenter and autoscaler will act when a pod is pending. For now we have a 5' delay in cluster-autoscaler to favor the karpenter for scaling

Networking and Provisioners points should be properly integratged with the product, e.g. with the Custom Machine Pool (KarpenterMachinePool exposing the full Karpenter provisioner implementation) or a label on the Machine Pool. What we need from here is Role, LaunchTemplate and Subnet, which are created either with CAPA or Vintage when creating the MachinePool. There are talks upstream how to integrate in the CAPI world - https://github.com/aws/karpenter-core/issues/747
```
### Tasks
```
[ ] https://github.com/giantswarm/roadmap/issues/2727
[ ] https://github.com/giantswarm/roadmap/issues/3102
[ ] https://github.com/giantswarm/giantswarm/issues/30089
[ ] https://github.com/giantswarm/giantswarm/issues/30231
[ ] https://github.com/giantswarm/roadmap/issues/3382
[ ] https://github.com/giantswarm/roadmap/issues/3345
[ ] https://github.com/giantswarm/giantswarm/issues/28703
[ ] https://github.com/giantswarm/roadmap/issues/3402
[ ] https://github.com/giantswarm/giantswarm/issues/30749
[ ] https://github.com/giantswarm/giantswarm/issues/30985

T-Kukawka commented 1 year ago

next steps: Spec for CAPA/Vintage implementation

elmiko commented 1 year ago

:wave:

@T-Kukawka this sounds like a really interesting direction, i just wanted to share that we have recently formed a feature group in the cluster-api community to address karpenter integration. we are planning to have our first meeting after kubecon next week, more details here https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/community/20231018-karpenter-integration.md

this topic would certainly be welcome if you are interested in having a wider discussion about karpenter and CAPA.

paurosello commented 11 months ago

The main issue we have with the new CRs (NodeClasses, NodePools) is that we can't use the LaunchTemplates anymore as it has been deprecated.

Not being able to reference a LaunchTemplate like in the old Provider CR means that we need an operator to create and manage a NodeClass (where we set the userData with the required values to join the cluster) which is a bit more involved.

For now I am more inclined to keep using the old releases and see if we can find a solution as a community in the new feature group that @elmiko is leading

elmiko commented 11 months ago

@paurosello one of the top concerns for the karpenter feature group is ensuring that cluster api users continue to have the experience of using that (CAPI) api to manage their infrastructure. we are still figuring out what that means, but i think we at least agree that it's a top goal for the group.

paurosello commented 11 months ago

yeah, 100% and we are committed to work with the community to evolve the Karpenter integration in CAPA, we will need to work with the old API for a while until we get there with the full integration.

paurosello commented 8 months ago

Currently the main issue we are facing is the CAPI taint of the nodes does not get removed because it's not a Machine in the API and it can't be disabled. More info https://github.com/kubernetes-sigs/cluster-api/issues/9858

giantswarm / roadmap

Karpenter GA for AWS Vintage and CAPI #2705