unity-sds / unity-sps

The Unity SDS Processing Service facilitates large-scale data processing for scientific workflows.
Apache License 2.0
2 stars 2 forks source link

Implement autoscaling of Kubernetes worker nodes #45

Open LucaCinquini opened 6 months ago

LucaCinquini commented 6 months ago

Use Karpenter for autoscaling of nodes

Acceptance Criteria: o Demonstrated autoscaing of k8s nodes (and pods) when a large number of jobs is submitted, and scaling back down to 0 nodes when all josb are executed o CI/CD pipeline for nightly test of autoscaling up and down

LucaCinquini commented 4 months ago

Also needed to support the On Demand task

LucaCinquini commented 4 months ago

From Drew on Slack: "I’ve made progress on integrating Karpenter into SPS for node autoscaling. However, I’ve hit an IAM blocker involving the CreateFleet operation. It appears the CreateFleet operation is not allowed for the mcp-tenantOperator-AMI-APIG permissions boundary which Karpenter uses for its KarpenterController IAM role. I’ve manually tried the CreateFleet operation using the AWS CLI and was able to replicate the permission denied issue while assuming the mcp-tenantOperator role."

After this discussion Mike filed a ticket with MCP support to update the permissions of the mcp-tenantROperator role.

LucaCinquini commented 4 months ago

MCP ticket: https://jaas.gsfc.nasa.gov/servicedesk/customer/portal/2/GSD-3066

drewm-jpl commented 3 months ago

Update: Still dealing with MCP permissions issues (surprise, surprise!)

LucaCinquini commented 3 months ago

Drew worked with MCP to solve all the IAM permissions problems. The latest version works for autoscaling nodes. Verified following Drew's instructions on Slack:

To test the autoscaling you can do the following:

Scale up a dummy demo deployment named “inflate”: kubectl scale deployment inflate --replicas 10

  1. Monitor the logs kubectl logs -f -n karpenter -l app.kubernetes.io/name=karpenter -c controller You’ll eventually see a line like this in the logs: {"level":"INFO","time":"2024-04-11T20:54:10.949Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"17dd42b","nodeclaim":"default-k64zz","provider-id":"aws:///us-west-2b/i-0941d23862a1d11ac","instance-type":"c5.2xlarge","zone":"us-west-2b","capacity-type":"spot","allocatable":{"cpu":"7910m","ephemeral-storage":"26Gi","memory":"14162Mi","pods":"58","vpc.amazonaws.com/pod-eni":"38"}} In the AWS console, go to your EKS cluster and checkout the compute tab. You should see a new node pop up.
  2. Scale down the dummy demo deployment named “inflate”: kubectl scale deployment inflate --replicas 0 You’ll eventually see a line like this in the logs: {"level":"INFO","time":"2024-04-11T21:07:03.235Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"17dd42b","nodeclaim":"default-k64zz","node":"ip-10-6-48-231.us-west-2.compute.internal","provider-id":"aws:///us-west-2b/i-0941d23862a1d11ac"}
LucaCinquini commented 3 months ago

Note the double "//" in this line:

source = "git@github.com:unity-sds/unity-cs-infra.git//terraform-unity-eks_module?ref=u-sps-24.1-beta.01"