sassoftware / viya4-deployment

This project contains Ansible code that creates a baseline in an existing Kubernetes environment for use with the SAS Viya Platform, generates the manifest for an order, and then can also deploy that order into the Kubernetes environment specified.
Apache License 2.0
71 stars 64 forks source link

feat: (IAC-903) Update Cluster Autoscaler for EKS 1.25 Support #399

Closed jarpat closed 1 year ago

jarpat commented 1 year ago

Changes

Since the 2023.03 cadence adds support for K8s 1.25, we want to add support for that version across all the IAC projects. The usual process for this is just bumping the kubernetes_version & kubectl values so that we are in the +1/-1 range of all the versions the latest cadence supports.

Upon using viya4-deployment to baseline a 1.25 EKS cluster we ran into an issue installing the cluster-autoscaler, we needed to update it to a version that supports PodDisruptionBudget policy/v1 since policy/v1beta1 was deprecated as of 1.25. The new default CLUSTER_AUTOSCALER_CHART_VERSION we use for 1.25+ clusters is version 9.25.0. With the update to the autoscaler version we also needed to make changes on the viya4-iac-aws side to support this new version, updates were made to the cluster-autoscaler Role policy to line up with the recommendations from the kubernetes/autoscaler documentation so that it can properly function.

You can see the policy update in this PR https://github.com/sassoftware/viya4-iac-aws/pull/189 and the changes will be officially released as part of viya4-iac-aws:5.6.0

Note: for EKS clusters <1.25 we still use the CLUSTER_AUTOSCALER_CHART_VERSION default value of 9.9.2

In case a user uses version 5.5.0 of viya4-iac-aws or earlier to create their K8s 1.25 infrastructure in AWS (which would not include the Role policy updates), the troubleshooting documentation has steps remediate this by either using the latest version of viya4-iac-aws to update the Role policy or manual steps the user can take in the AWS IAM Console to achieve the same.

Tests

Tested the following scenarios, more details in internal ticket.

Note: for all these scenarios I set V4_CFG_CAS_WORKER_COUNT: 3 in my ansible vars to ensure that the autoscaler functioned and provisioned the additional CAS nodes. I also ensured upon uninstall the unused nodes were automatically removed.

Scenario Task Provider initial viya4-iac-aws version kubernetes_version Order Cadence Orchestration Deployment Method V4_CFG_CAS_WORKER_COUNT CLUSTER_AUTOSCALER_CHART_VERSION kubectl Version Notes
1 OOTB AWS IAC-903 1.22 ***** lts:2022.09 Deployment Operator Docker 3 9.9.2 (default) 1.24.10
2 OOTB AWS IAC-903 1.24 ***** stable:2023.02 Deployment Operator Docker 3 9.9.2 (default) 1.24.10
3 OOTB AWS IAC-903 1.25 ***** fast:2020 Deployment Operator Docker 3 9.25.0 (default) 1.24.10
4 OOTB AWS 5.5.0 1.25 ***** fast:2020 Deployment Operator Docker 3 9.25.0 (default) 1.24.10 expected initial autoscaler install failure, verified that troubleshooting option 1 resolved issue
4 OOTB AWS 5.5.0 1.25 ***** fast:2020 Deployment Operator Docker 3 9.25.0 (default) 1.24.10 expected initial autoscaler install failure, verified that troubleshooting option 2 resolved issue