sassoftware / viya4-deployment

This project contains Ansible code that creates a baseline in an existing Kubernetes environment for use with the SAS Viya Platform, generates the manifest for an order, and then can also deploy that order into the Kubernetes environment specified.
Apache License 2.0
70 stars 64 forks source link

sas-deployment-operator-reconcile failure loop #471

Closed lancehampton closed 1 year ago

lancehampton commented 1 year ago

Viya4 Deployment Version Details

6.10.0

Ansible Variable File Details

## Cluster
NAMESPACE: dev-namespace

## MISC
DEPLOY: true # Set to false to stop at generating the manifest
LOADBALANCER_SOURCE_RANGES: ['0.0.0.0/0']

## Storage 
V4_CFG_MANAGE_STORAGE: true

## SAS Software Order
V4_CFG_ORDER_NUMBER:  <my-order-number>
# V4_CFG_DEPLOYMENT_ASSETS: ''
# V4_CFG_LICENSE: ''
# V4_CFG_CERTS: ''

## SAS API Access
V4_CFG_SAS_API_KEY: '<my-api-key>'
V4_CFG_SAS_API_SECRET: '<my-api-secret>'

## CR Access
V4_CFG_CR_USER: null
V4_CFG_CR_PASSWORD: null

## Ingress
V4_CFG_INGRESS_TYPE: ingress
V4_CFG_INGRESS_FQDN: dev-viya4.example.com
V4_CFG_TLS_MODE: "full-stack" # [full-stack|front-door|ingress-only|disabled]

## Postgres
V4_CFG_POSTGRES_SERVERS:
  default:
    internal: true
    postgres_pvc_storage_size: 10Gi
    postgres_pvc_access_mode: ReadWriteOnce
    postgres_storage_class: pg-storage
    backrest_storage_class: pg-storage
  cds-postgres:
    internal: true
    postgres_pvc_storage_size: 10Gi
    postgres_pvc_access_mode: ReadWriteOnce
    postgres_storage_class: pg-storage
    backrest_storage_class: pg-storage

## LDAP
V4_CFG_EMBEDDED_LDAP_ENABLE: true

## Consul UI
V4_CFG_CONSUL_ENABLE_LOADBALANCER: false

## SAS/CONNECT
V4_CFG_CONNECT_ENABLE_LOADBALANCER: false

## Monitoring and Logging
## uncomment and update the below values when deploying the viya4-monitoring-kubernetes stack
#V4M_BASE_DOMAIN: <base_domain>

## Viya Start and Stop Schedule
## uncomment and update the values below with CronJob schedule expressions if you would
## like to start and stop your Viya Deployment on a schedule
#V4_CFG_VIYA_START_SCHEDULE: "0 7 * * 1-5"
#V4_CFG_VIYA_STOP_SCHEDULE: "0 19 * * 1-5"

Steps to Reproduce

  1. Run viya4-iac-azure v8.0.0 release using local Terraform and specifying Kubernetes v1.25 (succeeds)
  2. Run viya4-deployment v6.10.0 release using Docker with --tags "baseline,install", followed by --tags "viya,install", both succeed (Ansible indicates no failed plays). No customizations other than what you can see in the Ansible variables.

Expected Behavior

Remaining Viya4 objects (pods, etc.) should be provisioned.

Actual Behavior

Additional Context

Log output from sasdeployments CRD:

"messages": [ "While processing 'default', key 'SAS_SPRE_VAR_PATH_RUN' in ConfigMap 'sas-programming-environment-path-config' was defined by multiple components 'sas-batch, sas-compute, sas-connect, sas-job-flow-scheduling, sas-launcher' with the same value\nWhile processing 'default', key 'SAS_SPRE_VAR_PATH_SPOOL' in ConfigMap 'sas-programming-environment-path-config' was defined by multiple components 'sas-batch, sas-compute, sas-connect, sas-job-flow-scheduling, sas-launcher' with the same value\nWhile processing 'default', key 'SAS_SPRE_VAR_PATH_TMP' in ConfigMap 'sas-programming-environment-path-config' was defined by multiple components 'sas-batch, sas-compute, sas-connect, sas-job-flow-scheduling, sas-launcher' with the same value\nWhile processing 'default', key 'SAS_SPRE_APP_BATCH' in ConfigMap 'sas-programming-environment-path-config' was defined by multiple components 'sas-batch, sas-compute, sas-connect, sas-job-flow-scheduling, sas-launcher' with the same value\nWhile processing 'default', key 'SAS_SPRE_APP_COMPUTE' in ConfigMap 'sas-programming-environment-path-config' was defined by multiple components 'sas-batch, sas-compute, sas-connect, sas-job-flow-scheduling, sas-launcher' with the same value\nWhile processing 'default', key 'SAS_SPRE_APP_CONNECT' in ConfigMap 'sas-programming-environment-path-config' was defined by multiple components 'sas-batch, sas-compute, sas-connect, sas-job-flow-scheduling, sas-launcher' with the same value\nWhile processing 'default', key 'SAS_SPRE_VAR_PATH' in ConfigMap 'sas-programming-environment-path-config' was defined by multiple components 'sas-batch, sas-compute, sas-connect, sas-job-flow-scheduling, sas-launcher' with the same value\nWhile processing 'default', key 'SAS_SPRE_VAR_PATH_LOG' in ConfigMap 'sas-programming-environment-path-config' was defined by multiple components 'sas-batch, sas-compute, sas-connect, sas-job-flow-scheduling, sas-launcher' with the same value\nWhile processing 'default', key 'SAS_INIT_JRE_POLICY_FILE' in ConfigMap 'sas-programming-environment-java-policy-config' was defined by multiple components 'sas-batch, sas-compute, sas-connect, sas-job-flow-scheduling, sas-launcher' with the same value\nWhile processing 'default', key 'SAS_INIT_JRE_POLICY_PROPERTY' in ConfigMap 'sas-programming-environment-java-policy-config' was defined by multiple components 'sas-batch, sas-compute, sas-connect, sas-job-flow-scheduling, sas-launcher' with the same value\nWhile processing 'default', key 'SAS_INIT_JRE_POLICY_SOCKET' in ConfigMap 'sas-programming-environment-java-policy-config' was defined by multiple components 'sas-batch, sas-compute, sas-connect, sas-job-flow-scheduling, sas-launcher' with the same value\nWhile processing 'default', key 'SAS_INIT_JRE_POLICY_RUNTIME' in ConfigMap 'sas-programming-environment-java-policy-config' was defined by multiple components 'sas-batch, sas-compute, sas-connect, sas-job-flow-scheduling, sas-launcher' with the same value", "> kubectl apply --namespace dev-namespace --timeout 7200s -f /work/permissions/manifest.yaml\nrole.rbac.authorization.k8s.io/sas-deployment-operator-dev-namespace-sas-viya created\nclusterrole.rbac.authorization.k8s.io/sas-deployment-operator-dev-namespace-sas-viya created\nrolebinding.rbac.authorization.k8s.io/sas-deployment-operator-dev-namespace-sas-viya created\nclusterrolebinding.rbac.authorization.k8s.io/sas-deployment-operator-dev-namespace-sas-viya created\n\n> run deploy --namespace dev-namespace --manifest /work/deploy/manifest.yaml --deploymentDir /work/deploy/resources/generation --serviceAccountName sas-deployment-operator-reconcile-permissions --timeout 7200s --clusterApiNamespace --clusterApiManifest /work/cluster-api/clusterAPIManifest.yaml\n\n\n\n> start_leading dev-namespace\n\n\n> kubectl delete --namespace dev-namespace --wait --timeout 7200s --ignore-not-found configmap sas-deploy-lifecycle-operation-variables\n\n\n> kubectl create --namespace dev-namespace configmap sas-deploy-lifecycle-operation-variables\nconfigmap/sas-deploy-lifecycle-operation-variables created\n\n\n> run deploy-assess --namespace dev-namespace --deploymentDir /work/deploy/resources/generation --timeout 7200s --serviceAccountName sas-deployment-operator-reconcile-permissions --manifest /work/deploy/manifest.yaml\n\n\n> run assess --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml --deploymentDir /work/deploy/resources/generation --serviceAccountName sas-deployment-operator-reconcile-permissions\n\n\n> run assess-cas --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml\n\n\n> set_variable cas.sas.com/operator_manifest_version 3.23.2-20221104.1667569683194\n\n\n> set_variable cas.sas.com/server_manifest_version 1.35.47-20230516.1684277559890\n\n\n\n> run assess-crunchy --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml\n\n\n\n\n\n> run deploy-assess-sitedefault --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml\nsitedefault.yaml passed the yaml validation\n\n\n> run kubernetes-kubectl-server-version-alignment-check --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml\nversion difference between client (1.23.11) and server (1.25.11) exceeds the supported minor version skew of +/-1\n\n\nStep failed: kubectl and server version combination exceed the supported skew window\n\nOperation 'kubernetes-kubectl-server-version-alignment-check' failed\n\n\n> run version-check --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml\n\nOperation 'assess' failed\n\n\n> run deploy-assess-cachelocator --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml --deploymentDir /work/deploy/resources/generation --serviceAccountName sas-deployment-operator-reconcile-permissions\n\n\n\n> run deploy-assess-cacheserver --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml --deploymentDir /work/deploy/resources/generation --serviceAccountName sas-deployment-operator-reconcile-permissions\n\n\n\n> run deploy-assess-dataserver --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml --deploymentDir /work/deploy/resources/generation --serviceAccountName sas-deployment-operator-reconcile-permissions\n\n\n\n\n> run deploy-assess-pyconfig-cronjob --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml --deploymentDir /work/deploy/resources/generation --serviceAccountName sas-deployment-operator-reconcile-permissions\n\n> run deploy-assess-pyconfig-execute --namespace dev-namespace\n\n\n> kubectl annotate --namespace dev-namespace configmap sas-deploy-lifecycle-operation-variables sas.com/sas-pyconfig-update=true --overwrite\nconfigmap/sas-deploy-lifecycle-operation-variables annotated\n\n\n\n\n> run deploy-assess-pyconfig-job --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml --deploymentDir /work/deploy/resources/generation --serviceAccountName sas-deployment-operator-reconcile-permissions\n\n\n\n> run deploy-assess-commonfiles --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml --deploymentDir /work/deploy/resources/generation --serviceAccountName sas-deployment-operator-reconcile-permissions\n\n\n\n> run deploy-assess-consul --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml --deploymentDir /work/deploy/resources/generation --serviceAccountName sas-deployment-operator-reconcile-permissions\n\n\n\n> run deploy-assess-elasticsearch --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml --deploymentDir /work/deploy/resources/generation --serviceAccountName sas-deployment-operator-reconcile-permissions\n\n\n\n> run deploy-assess-rabbitmq --namespace dev-namespace --timeout 7200s --manifest /work/deploy/manifest.yaml --deploymentDir /work/deploy/resources/generation --serviceAccountName sas-deployment-operator-reconcile-permissions\n\n\nOperation 'deploy-assess' failed\n\n\n> kubectl delete --namespace dev-namespace --wait --timeout 7200s --ignore-not-found configmap sas-deploy-lifecycle-operation-variables\nconfigmap \"sas-deploy-lifecycle-operation-variables\" deleted\n\n\n> stop_leading dev-namespace\n\nOperation 'deploy' failed\n\n> kubectl delete --namespace dev-namespace --timeout 7200s -f /work/permissions/manifest.yaml\nrole.rbac.authorization.k8s.io \"sas-deployment-operator-dev-namespace-sas-viya\" deleted\nwarning: deleting cluster-scoped resources, not scoped to the provided namespace\nclusterrole.rbac.authorization.k8s.io \"sas-deployment-operator-dev-namespace-sas-viya\" deleted\nrolebinding.rbac.authorization.k8s.io \"sas-deployment-operator-dev-namespace-sas-viya\" deleted\nclusterrolebinding.rbac.authorization.k8s.io \"sas-deployment-operator-dev-namespace-sas-viya\" deleted", "Operation 'reconcile-once.deploy' failed" ]

References

Initial Troubleshooting

Code of Conduct

lancehampton commented 1 year ago

I found a workaround for this, though it isn't a good long-term fix for me:

  1. Change the kubernetes_version = "1.24" in the viya4-iac-azure variables and redeploy the infrastructure.
  2. Re-run the viya4-deployment using Docker.

Doing those two actions resolved the failed kubernetes-kubectl-server-version-alignment-check in the sas-deployment-operator-reconcile pod. However, now our cluster is running on K8s 1.24.x instead of something with longer support on AKS. This means I need to consider a K8s version upgrade sooner than if I could have just rolled the deployment with the desired version from the start.

Recommendations

lancehampton commented 1 year ago

Victory was short-lived. On 6 August 2023 Microsoft released a new version of AKS that dropped K8s 1.24 support from several regions. As a result we have had to move our test deployments to regions that have not received the new K8s version yet. At this time centralus and southcentralus are still offering 1.24, so that's all we can do until the version of kubectl inside the sas-operator-deploy-reconcile image is updated.

jarpat commented 1 year ago

Hey @lancehampton,

In the initial comment, from your error message I see the following:

version difference between client (1.23.11) and server (1.25.11) exceeds the supported minor version skew of +/-1\n\n\nStep failed: kubectl and server version combination exceed the supported skew window\n\nOperation 'kubernetes-kubectl-server-version-alignment-check' failed

That means that the Viya cadence version you are deploying is not compatbile with K8s 1.25.

From your Ansible Variable File Details I see that V4_CFG_CADENCE_NAME and V4_CFG_CADENCE_VERSION were not specified, so the default values for those were used.

V4_CFG_CADENCE_NAME: "lts"
V4_CFG_CADENCE_VERSION: "2022.09"
# see defaults https://github.com/sassoftware/viya4-deployment/blob/main/docs/CONFIG-VARS.md#sas-software-order

And going off the SAS documentation for lts 2022.09, I can see that it's only compatible with K8s 1.21-1.24 doc: https://documentation.sas.com/?cdcId=itopscdc&cdcVersion=v_032&docsetId=itopssr&docsetTarget=n1ika6zxghgsoqn1mq4bck9dx695.htm#n14svbt21hwb8jn17tg2u6pkaplw

I would recommend you choose a newer lts or stable release and set it with those variables in your ansible variable file. For example lts 2023.03 supports K8s 1.23-1.25 or you could choose a stable cadence 2023.03 or newer which will support K8s 1.25.

See this table in the SAS Viya Platform Operations documentation for a quick reference of SAS Viya Platform Version and K8s version compatibility. https://documentation.sas.com/?cdcId=itopscdc&cdcVersion=v_042&docsetId=itopssr&docsetTarget=n1ika6zxghgsoqn1mq4bck9dx695.htm#p0nir72r7wvm6sn1wsxpkup0zso7

lancehampton commented 1 year ago

Thank you very much @jarpat. Your solution resolved the issue and let us deploy to K8s v1.26. Your references helped us better understand the relevance of CADENCE versions.

We are using the viya4-iac-azure and associated ansible-iac-vars.yaml and we didn't see the V4_CFG_CADENCE_NAME and V4_CFG_CADENCE_VERSION in that variable declaration file. Obviously we needed to look deeper for some kind of default elsewhere.