pangeo-data / jupyter-earth

Jupyter meets the Earth: combining research use cases in geosciences with technical developments within the Jupyter and Pangeo ecosystems.
https://jupytearth.org
Creative Commons Zero v1.0 Universal
28 stars 6 forks source link

Scaling to meet worker capacity demands. #82

Closed consideRatio closed 2 years ago

consideRatio commented 2 years ago

I observe the following by observing that our cluster-autoscaler is requesting a lot more than AWS is providing.

By using the following, I inspect the state of the cluster-autoscaler that requests more nodes for the k8s cluster when it observes pending pods.

Observe the cloudProviderTarget=8 and cloudProviderTarget=3 for these node pools, and compare them with the ready=2 and ready=0. Why won't AWS scale up for us?

kubectl get cm -n kube-system cluster-autoscaler-status -o yaml
      Name:        eksctl-jmte-nodegroup-worker-a-16-NodeGroup-AU4DMCK46TW0
      Health:      Healthy (ready=2 unready=0 notStarted=0 longNotStarted=0 registered=2 longUnregistered=0 cloudProviderTarget=8 (minSize=0, maxSize=8))
                   LastProbeTime:      2021-09-30 16:13:18.791229527 +0000 UTC m=+10531772.496293969
                   LastTransitionTime: 2021-09-30 15:46:22.863867587 +0000 UTC m=+10530156.568932028
      ScaleUp:     InProgress (ready=2 cloudProviderTarget=8)
                   LastProbeTime:      2021-09-30 16:13:18.791229527 +0000 UTC m=+10531772.496293969
                   LastTransitionTime: 2021-09-30 15:58:40.202629511 +0000 UTC m=+10530893.907693954
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2021-09-30 16:13:18.791229527 +0000 UTC m=+10531772.496293969
                   LastTransitionTime: 2021-09-30 15:26:54.12862169 +0000 UTC m=+10528987.833686150

      Name:        eksctl-jmte-nodegroup-worker-a-64-NodeGroup-LMJQS00OMHB4
      Health:      Healthy (ready=0 unready=0 notStarted=0 longNotStarted=0 registered=0 longUnregistered=0 cloudProviderTarget=3 (minSize=0, maxSize=8))
                   LastProbeTime:      2021-09-30 16:13:18.791229527 +0000 UTC m=+10531772.496293969
                   LastTransitionTime: 2021-09-29 22:40:21.725302077 +0000 UTC m=+10468595.430366524
      ScaleUp:     InProgress (ready=0 cloudProviderTarget=3)
                   LastProbeTime:      2021-09-30 16:13:18.791229527 +0000 UTC m=+10531772.496293969
                   LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2021-09-30 16:13:18.791229527 +0000 UTC m=+10531772.496293969
                   LastTransitionTime: 2021-09-29 20:32:46.571545035 +0000 UTC m=+10460940.276609483

The AWS console, from https://us-west-2.console.aws.amazon.com/ec2autoscaling/home?region=us-west-2#/details/eksctl-jmte-nodegroup-worker-a-16-NodeGroup-AU4DMCK46TW0?view=activity, reported that it failed to scale up properly.

image

Action point

consideRatio commented 2 years ago

This was resolved!