oracle-terraform-modules / terraform-oci-oke

The Terraform OKE Module Installer for Oracle Cloud Infrastructure provides a Terraform module that provisions the necessary resources for Oracle Container Engine.
https://oracle-terraform-modules.github.io/terraform-oci-oke/
Universal Permissive License v1.0
154 stars 208 forks source link

5.x: Clarify documentation for autoscaler #809

Open rodrigc opened 1 year ago

rodrigc commented 1 year ago

The existing docs for the autoscaler at https://oracle-terraform-modules.github.io/terraform-oci-oke/guide/workers_scaling.html could use some clarification.

In private discussion with @hyder , he mentioned:

Let me explain the cluster autoscaler mechanics here

Cluster Autoscaler has 2 main requirements:
the CA pod must run on a worker node from nodepool that it itself is not managing i.e. is not autoscaled
the CA pod has the necessary permissions to perform cluster autoscaling actions

For (1), we use a dedicated nodepool. By default, we economically set its size to 1. From the perspective of the cluster autoscaler, this node pool is "unmanaged". Please do not confuse this with self-managed nodes.

For (2), we do a few things:
a. ensure that the cluster autoscaler pod always lands on this worker node by using taint and tolerations.
b. assign kubernetes labels that will facilitate (a)
c. create a dynamic group that uses defined tags as rules to determine membership
d. apply defined tags on the unmanaged worker node from the nodepool in (1) that will make worker nodes from this pool a member of dynamic group
e. create policies that give the dynamic group in (c) the ability to perform autoscaling actions
when creating worker pools, you then need to let the CA know which worker pools you want to be autoscaled

e.g. you may not want all the pools in your cluster to be autoscaling, especially if their shapes are on the high end and cost more. This is important because OKE allows you to run worker pools for mixed performance

so the autoscale parameter in each worker pool allows you to tell the autoscaler whether it has to manage this pool. By default, it's false I think and you have to explicitly enable it

@robo-cap

rodrigc commented 1 year ago

@robo-cap in the meeting we had today, I mentioned I came up with this query:

kubectl get nodes --output=custom-columns='NAME:.metadata.name,POOL:.metadata.labels.oke\.oraclecloud\.com/pool\.name,CLUSTER_AUTOSCALER:.metadata.labels.oke\.oraclecloud\.com/cluster_autoscaler'
NAME           POOL           CLUSTER_AUTOSCALER
node01         pool1         managed
node02        pool1          managed
node03        pool2        managed
node04        pool1         managed
node05        pool5         managed
node06        pool3           managed
node07        pool2         managed
node08       pool4         managed
node09        pool1          managed
node10   pool5         managed
node11   pool6          allowed
node12   pool1          managed
node13   pool1          managed
node14   pool4        managed
node15    pool3         managed
node16   pool5         managed
node17   pool5        managed
node18   pool4        managed
node19   pool2         managed
node20   pool6         allowed
node21   pool5        managed

As part of clarifying the docs at https://oracle-terraform-modules.github.io/terraform-oci-oke/guide/workers_scaling.html it would be good to improve the docs which explain what the values managed, allowed, disabled for the label oke.oraclecloud.com/cluster_autoscaler actually do.

It wasn't clear to me when reading the docs. Ali clarified in private conversations, but it would be good to have this in the docs.

rodrigc commented 1 year ago

In meeting I had on August 30, @robo-cap suggested that the improvements to the autoscaler docs for this module might benefit from having a link to the OCI autoscaler at: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/oci#cluster-autoscaler-for-oracle-cloud-infrastructure-oci

I think this is a good idea, since it for users of this Terraform module, it will help those who need to set up or debug autoscaler issues.