vexxhost / magnum-cluster-api

Cluster API driver for OpenStack Magnum
Apache License 2.0
41 stars 17 forks source link

Cluster Autoscaling only adds 1 additional worker node #309

Closed jessica-hofmeister closed 4 months ago

jessica-hofmeister commented 4 months ago

On a magnum cluster with autoscaling enabled and a max node count of 5, I ran 3 busybox containers with high request values to trigger autoscaling. This worked well for the first 2 containers: the first ran on the initial worker, and the second triggered another worker to spawn. The third container however just sits in a Pending stage

NAME                       READY   STATUS    RESTARTS   AGE   IP               NODE                                          NOMINATED NODE   READINESS GATES
busybox-66f7cb5b8b-gspns   1/1     Running   0          92m   10.100.28.1      kube-c9d9y-default-worker-infra-nmjcn-cm8pj   <none>           <none>
busybox-66f7cb5b8b-qxb5f   1/1     Running   0          92m   10.100.113.131   kube-c9d9y-default-worker-infra-nmjcn-mz5f8   <none>           <none>
busybox-66f7cb5b8b-sb42l   0/1     Pending   0          92m   <none>           <none>                                        <none>           <none>

and a describe on the Pending pod shows the following events:

Normal   TriggeredScaleUp   2m2s   cluster-autoscaler  pod triggered scale-up: [{MachineDeployment/magnum-system/kube-c9d9y-default-worker-wkqwc 1->2 (max: 2)}]
  Normal   NotTriggerScaleUp  110s   cluster-autoscaler  pod didn't trigger scale-up: 1 max node group size reached

The events on the kube-system/cluster-autoscaler-status configmap are

Events:
  Type    Reason         Age   From                Message
  ----    ------         ----  ----                -------
  Normal  ScaledUpGroup  58m   cluster-autoscaler  Scale-up: setting group MachineDeployment/magnum-system/kube-c9d9y-default-worker-wkqwc size to 2 instead of 1 (max: 2)
  Normal  ScaledUpGroup  58m   cluster-autoscaler  Scale-up: group MachineDeployment/magnum-system/kube-c9d9y-default-worker-wkqwc size set to 2 instead of 1 (max: 2)

So it seems like there is something wrong with the cluster autoscaler since there should be 3 more nodes able to be spawned before this behavior happens.

Here is the command I used to create the cluster (I also tried one with 7 worker nodes just for kicks)

openstack --os-cloud smoke-test-project coe cluster create \
 --cluster-template smoke-test-v1.27.4-ubuntu-2204-calico \
 --master-count 3 \
 --node-count 5 \
 --fixed-network smoke-test-network \
 --keypair svc_mgd_admin \
 --floating-ip-disabled \
 --labels manila_csi_share_network_id=$manila_id,auto_healing_enabled=True,auto_scaling_enabled=True \
 smoke-test
okozachenko1203 commented 4 months ago

Hi @jessica-hofmeister This is the expected result. When you enabled autoscaling, you'd need to mention the max and min node numbers by using max_node_count and min_node_count labels. And no need to specify --node-count which is ignored. If you didn't set those labels, the default value is min=1 and max=2.