sassoftware / viya4-deployment

This project contains Ansible code that creates a baseline in an existing Kubernetes environment for use with the SAS Viya Platform, generates the manifest for an order, and then can also deploy that order into the Kubernetes environment specified.
Apache License 2.0
71 stars 64 forks source link

sas-cas-server in Init Status after Install #235

Closed lorenzk1213 closed 10 months ago

lorenzk1213 commented 2 years ago

After running the deployment to install Viya 4, All pods seems to be running fine

image

except for sas-cas-server-default which is in Init Status image

This is seen in the describe output

image

Appreciate any suggestions.

Thanks,

thpang commented 2 years ago

It's telling you it cannot find nfs-homes or nfs-data Do you have an NFS server setup in your cluster? And if so did the IAC code base set this up or did your bring it yourself. If you have no NFS server what is your storage type and how did you map that information in your ansible vars file?

lorenzk1213 commented 2 years ago

Hi @thpang

We are using EFS in this environment and it was provisioned separately not using the IAC code. We map it by setting V4_CFG_MANAGE_STORAGE: false in our ansible vars yaml file.

lorenzk1213 commented 2 years ago

To add. one of the items we are wondering is where could this path be defined in the scripts:

our storage path is /viya-share/non-prod-dev-viya-ns however seems the deployment is reading it from //non-prod-non-prod-dev-viya-ns/viya-share/non-prod-dev-viya-ns

image

thpang commented 2 years ago

If you look in the terraform output you can you'll see these entries for your filestore: rwx_filestore_endpoint and rwx_filestore_path these are the values that are used by the pods to mount into your storage.

lorenzk1213 commented 2 years ago

@thpang

This is the value from our terraform output for the 2 entries.

image

Seems rwx_filestore_path configured is "/" directory. But seems still different from the path the deployment reads which is //non-prod-non-prod-dev-viya-ns

Is it possible we have missed something?

Thanks,

ceciivanov commented 2 years ago

Hello i have the same exact problem, but i haven't changed anything in terraform vars neither bring up custom nfs storage server. The output in terraform for rwx_file_store_endpoind is an IP inside the subnet of our Vnet and the rwx_file_store_path is /export What could the error be here?

lorenzk1213 commented 2 years ago

@cceeci99 May I see the output of the command:

kubectl get pods -n and kubectl describe pod sas-cas-server-default-controller -n

ceciivanov commented 2 years ago

yeah i know with that command i got the exact same description, i think the problem is that the pods can't find the NFS storage to mount volumes

thpang commented 2 years ago

The // comes from the fact that a variable is not set which would include /viya-share so not quite sure what's a miss here. We never create paths with the // prefix.

ceciivanov commented 2 years ago

I found the problem is that if you configure terraform to make its own network security group you must specify vm_public_access_cidrs so it can open an SSH port to the jumpuser, and when running the deployment with ansible, it will be able to connect to the jumpuser and make the directories that are needed. If it can't connect the ansible somehow manages to skip that error and doesn't create the directories but create all the deployments and that's why the sas-cas-controller could not mount the volumes

thpang commented 2 years ago

That is correct if there is no access to the vm's from ansible it will simply skip those steps without erroring. It's guessing you've handled those directories outside of the DAC code base. Did you set the default_public_access_cidr value or did you leave this empty?

lorenzk1213 commented 2 years ago

@cceeci99 @thpang

the default_public_access_cidr in our IAC .tfvars file is empty. Is it required for this to have a value?

image

Also we did not configure the IAC to create a jumpbox as this was already provided by customer separately.

ceciivanov commented 2 years ago

@lorenzk1213 have this as default meaning empty, because I do not want my resources to be accessed by any other IPs, as I have a VPN In my resource group and everything in the vnet will have access to them

dhoucgitter commented 10 months ago

Marking as stale/inactive. If there are further questions please open a new GitHub issue.