Closed andybouts closed 3 years ago
FYI, my CAS cluster would not start because of this ...
jumpuser@mTES-TT-jump-vm:~/viya4-deployment$ kk describe pod sas-cas-server-default-controller
<<<>>>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 3m40s (x156 over 109m) kubelet, aks-cas-20262631-vmss000000 (combined from similar events): MountVolume.SetUp failed for volume "nfs-homes" : mount failed: exit status 32
Did you verify that the directories and mount points were created correctly on the NFS/Jump servers? You can create the locations; however, you need to ensure that the permissions are correct. On the Jump server you'll find the /viya-share directory not the /export directory. That directory is created on the NFS server.
It seems that ansible on the jump-vm may not have been able to ssh back to itself to create the directories for the mounts ... still investigating.
Since I am using the Docker command for the deployment instead of Ansible directly, I am struggling to figure out how to configure Ansible that isn't installed, but is instead a function of ansible inside the deployment container to ssh back to the jumpiest. Maybe sleep and coffee will help.
I modified the Ansible configuration file at .~/viya4-deployment/ansible.cfg and uncommented the line so that the Ansible logs would be written out to ./ansible.log, but that did not work, so I suspect that the Docker container has it's own Ansible.cfg file which still has the logging commented out. Unless someone has a better idea, I'll retry with a pipe of the log into a file.
The 4 CAS pods are still not starting up ... I am going to try a clean / fresh deployment with different ansible-vars.yaml values for the jump-host.
Yesterday, I re-ran the deployment from a clean / fresh IAC with the following command to re-direct the output to a file, note that I also pre-created the directories in question:
########
#deploy everything with Viya
########
# prereqs
sudo mkdir -p /viya-share/pvs/mtes-tt/astores
sudo mkdir -p /viya-share/pvs/mtes-tt/bin
sudo mkdir -p /viya-share/pvs/mtes-tt/data
sudo mkdir -p /viya-share/pvs/mtes-tt/homes
sudo chmod -R 777 /viya-share/pvs/mtes-tt
sudo chown -R nobody:nogroup /viya-share/pvs/mtes-tt
# deploy it
docker run --rm \
--group-add root \
--user $(id -u):$(id -g) \
--volume $HOME/viya4-deployment:/data \
--volume $HOME/viya4-deployment/deployments/MTES-TT-cluster/ansible-vars.yaml:/config/config \
--volume $HOME/viya4-deployment/deployments/MTES-TT-cluster/IAC_files/terraform.tfstate:/config/tfstate \
--volume $HOME/viya4-deployment/deployments/MTES-TT-cluster/MTES-TT-ns/site-config/sitedefault.yaml:/config/v4_cfg_sitedefault \
viya4-deployment --tags "baseline,viya,cluster-logging,cluster-monitoring,install" **> docker.ansible.log**
The expected tasks for the jump-host in the logs are:
jumpuser@mTES-TT-jump-vm:~/viya4-deployment$ grep -ir "jump-server" .
<truncated>
./roles/jump-server/tasks/main.yml:- name: jump-server - add host
./roles/jump-server/tasks/main.yml:- name: jump-server - lookup groups
./roles/jump-server/tasks/main.yml:- name: jump-server - create folders
<truncated>
All 3 tasks are missing from the log:
jumpuser@mTES-TT-jump-vm:~/viya4-deployment$ grep -i "jump-server" docker.ansible.log
jumpuser@mTES-TT-jump-vm:~/viya4-deployment$
I have the following configuration in ansible-vars.yaml, I suppose commenting these out is breaking something, but it should be parsing from the IAC tfstate, so it seems like a valid configuration for it to be commented out:
<tuncated>
## Jump Server: https://github.com/sassoftware/viya4-deployment/blob/main/docs/CONFIG-VARS.md#jump-server
# JUMP_SVR_HOST: # automatically parsed and pulled from the IAC tfstate
## NFS
#JUMP_SVR_RWX_FILESTORE_PATH # not used since the IAC created `/viya-share/pvs`
<tuncated>
I believe it should be supported to run the deployment from the jump-host that is provisioned from the IAC and will keep reviewing how this can be done.
Did you pass in the ssh key information for your jump box when you ran docker? I've seen this behavior when that information was missing. e.g. --volume $HOME/.ssh/id_rsa:/config/jump_svr_private_key \
Thanks @hahewlet , that makes sense and I will try that next (I was about to build it again).
I am used to telling Ansible which private key to use, but could not find it documented in this project how to pass the key information.
This looks promising, it's running some of the jump-server tasks:
jumpuser@mTES-TT-jump-vm:~/viya4-deployment$ tail -f docker.ansible.log
Wednesday 05 May 2021 18:03:58 +0000 (0:00:00.038) 0:00:02.874 *********
TASK [common : Set DEPLOY_DIR] *************************************************
ok: [localhost]
Wednesday 05 May 2021 18:03:58 +0000 (0:00:00.041) 0:00:02.916 *********
Wednesday 05 May 2021 18:03:58 +0000 (0:00:00.071) 0:00:02.987 *********
TASK [jump-server : jump-server - add host] ************************************
changed: [localhost]
Wednesday 05 May 2021 18:03:58 +0000 (0:00:00.062) 0:00:03.049 *********
One thing you have here is a chicken and egg scenario. The Jump server does not exist until after the IAC code base has completed. If you're using the Jump server as your box for Deployment why would you not use the box used to create the IAC in the first place? And if you're using a 3rd VM you've stood up in the cloud provider that would be the CIDR value you'd add to the cidr block in your tfvars file. Asking for clarity.
I believe this may be resolved, but it should stay open a tad longer while I verify.
In the event that another admin does not have all prereq's (e.g. docker) on the host that they run the IAC from, and that admin performs the remaining deployment steps from the newly created jump-server, there's 2 things you need to account for:
1) you will need to add a network security rule allowed the jump-server public IP to ssh back to itself (you can either do this programmatically with Az CLI, with TF, or manually in the portal)
2) you need to copy (scp / sFTP / etc) the private key over to the jump-server and then add the line @hahewlet noted above into the Docker command for the deployment.
With these 2 things in place, I seem to have the correct structure that Viya needs to startup correctly:
# for example:
jumpuser@mTES-TT-jump-vm:/viya-share/mtes-tt$ ll
total 24
drwxrwxrwx 6 nobody nogroup 4096 May 5 19:57 ./
drwxrwxrwx 5 nobody nogroup 4096 May 5 19:57 ../
drwxrwxrwx 2 nobody nogroup 4096 May 5 19:57 astores/
drwxrwxrwx 2 nobody nogroup 4096 May 5 19:57 bin/
drwxrwxrwx 2 nobody nogroup 4096 May 5 19:57 data/
drwxrwxrwx 2 nobody nogroup 4096 May 5 19:57 homes/
Also, confirmation from the Docker / Ansible log:
jumpuser@mTES-TT-jump-vm:~/viya4-deployment$ grep jump-server docker.ansible.log
TASK [jump-server : jump-server - add host] ************************************
TASK [jump-server : jump-server - lookup groups] *******************************
TASK [jump-server : jumps-server - group nogroup] ******************************
TASK [jump-server : jump-server - create folders] ******************************
jump-server : jump-server - lookup groups ------------------------------- 0.97s
jump-server : jump-server - create folders ------------------------------ 0.83s
jumpuser@mTES-TT-jump-vm:~/viya4-deployment$
The CAS pods are started an healthy, closing this issue for today, can revisit tomorrow is something else appears awry:
jumpuser@mTES-TT-jump-vm:~/viya4-deployment$ kk get pods | grep cas-s
sas-cas-server-default-controller 3/3 Running 0 35m
sas-cas-server-default-worker-0 3/3 Running 0 35m
sas-cas-server-default-worker-1 3/3 Running 0 35m
sas-cas-server-default-worker-2 3/3 Running 0 35m
jumpuser@mTES-TT-jump-vm:~/viya4-deployment$
I just ran a the IAC (Azure / cloned a few weeks ago) and the deployment project (cloned today).
The /export/../.. was not created:
Following the information from another issue, I manually created the export as a test:
I am not deploying LDAP since I will use SCIM instead, so the site default.yaml only contains the sabot information and the config to disable LDAP.
These are the relevant config lines in the ansible-vars.yaml: