Closed SASCloudLearner closed 1 month ago
Hey @SASCloudLearner,
To help get some more context to diagnose your issue could you provide your ansible-vars.yaml
that was generated using this project (viya4-iac-k8s) and if you are using the viya4-deployment project to perform your Viya deployment could you also provide the ansible-vars.yaml
you used?
below is from iac ansible-vars file. And SAS Ansible vars file is what i reported at https://github.com/sassoftware/viya4-deployment/issues/548. Please let me know if this information provides sufficient details
ansible_user = "ansible" ansible_password = "xxxxxxxxxxx" prefix = "ILT" # Infra prefix gateway = "10.0.x.xxx" # Gateway for servers netmask = 21 # Network interface netmask
vsphere_server = "abc.X0.kk.lan" # Name of the vSphere server vsphere_datacenter = "Datacenter" # Name of the vSphere data center vsphere_datastore = "XX_XX_XX" # Name of the vSphere data store to use for the VMs vsphere_resource_pool = "K8S-SAS-XX" # Name of the vSphere resource pool to use for the VMs vsphere_folder = "XX/XXX/K8S-SAS/Dev" # Name of the vSphere folder to store the vms vsphere_template = "Template Ubuntu 22.04" # Name of the VM template to clone to create VMs for the cluster vsphere_network = "XXXXXXX_3000" # Name of the network to to use for the VMs
system_ssh_keys_dir = "~/.ssh/XXX" # Directory holding public keys to be used on each system
cluster_version = "1.27.11" # Kubernetes Version cluster_cni = "calico" # Kubernetes Container Network Interface (CNI) cluster_cni_version = "3.27.2" # Kubernetes Container Network Interface (CNI) Version cluster_cri = "containerd" # Kubernetes Container Runtime Interface (CRI) cluster_cri_version = "1.6.28" # Kubernetes Container Runtime Interface (CRI) Version cluster_service_subnet = "10.255.0.0/17" # Kubernetes Service Subnet cluster_pod_subnet = "10.255.128.0/17" # Kubernetes Pod Subnet cluster_domain = "X9.XXXX.lan" # Cluster domain suffix for DNS
cluster_vip_version = "0.7.1" cluster_vip_ip = "10.0.X.XXX" cluster_vip_fqdn = "sas-kube-vip-dev.X9.XXXX.lan"
cluster_lb_type = "metallb" # Load Balancer accepted values [kube_vip,metallb]
#
#
#
#
#
#
#
metallb
that IP must fallkube_vip
# cluster_lb_addresses = ["10.0.X.124-10.0.X.129"]
control_plane_ssh_key_name = "XX_ssh"
#
#
# node_pools = {
control_plane = { count = 3 cpus = 2 memory = 4096 os_disk = 100 ip_addresses = [ "10.0.X.114","10.0.X.125","10.0.X.126", ] node_taints = [] node_labels = {} },
system = { count = 1 cpus = 8 memory = 65536 os_disk = 100 ip_addresses = [ "10.0.X.127", ] node_taints = [] node_labels = { "kubernetes.azure.com/mode" = "system" # REQUIRED LABEL - DO NOT REMOVE } }, cas = { count = 3 cpus = 8 memory = 196608 os_disk = 100 misc_disks = [ 300, ] ip_addresses = [ "10.0.x.120", "10.0.x.121", "10.0.x.122", ] node_taints = ["workload.sas.com/class=cas:NoSchedule"] node_labels = { "workload.sas.com/class" = "cas" } }, compute = { cpus = 8 memory = 65536 os_disk = 100 ip_addresses = [ "10.0.3.130", ] node_taints = ["workload.sas.com/class=compute:NoSchedule"] node_labels = { "workload.sas.com/class" = "compute" "launcher.sas.com/prepullImage" = "sas-programming-environment" } }, stateful = { cpus = 8 memory = 65536 os_disk = 100 ip_addresses = [ "10.0.x.101", ] node_taints = ["workload.sas.com/class=stateful:NoSchedule"] node_labels = { "workload.sas.com/class" = "stateful" } }, stateless = { cpus = 16 memory = 131072 os_disk = 100 misc_disks = [ 150, ] ip_addresses = [ "10.0.X.102", ] node_taints = ["workload.sas.com/class=stateless:NoSchedule"] node_labels = { "workload.sas.com/class" = "stateless" } } }
create_jump = false # Creation flag jump_num_cpu = 4 # 4 CPUs jump_memory = 8092 # 8 GB jump_disk_size = 100 # 100 GB jump_ip = "10.0.x.111" # Assigned values for static IPs
create_nfs = true # Creation flag nfs_num_cpu = 8 # 8 CPUs nfs_memory = 16384 # 16 GB nfs_disk_size = 2000 # 500 GB nfs_ip = "10.0.x.112" # Assigned values for static IPs
postgres_servers = { default = { server_num_cpu = 8 # 8 CPUs server_memory = 16384 # 16 GB server_disk_size = 250 # 256 GB server_ip = "10.0.x.113" # Assigned values for static IPs server_version = 13 # PostgreSQL version server_ssl = "off" # SSL flag administrator_login = "postgres" # PostgreSQL admin user - CANNOT BE CHANGED administrator_password = "xxxxxxxxx" # PostgreSQL admin user password } }
Going off your ansible-vars.yaml
from viya4-deployment issue, I believe it's an issue with your INGRESS_NGINX_CONFIG
which is commented out.
Depending on the Load Balancer type you chose, kube-vip
vs metallb
, you will need to adjust externalTrafficPolicy
accordingly. We have this documented here: https://github.com/sassoftware/viya4-iac-k8s/blob/main/docs/REQUIREMENTS.md#deployment
Side note: you can surround text with a triple backticks in the comments so it's formatted as a inline code block so that the yaml you paste gets rendered correctly, at the moment it's hard to read.
foo:
bar: value
bar2: value2
Hi @jarpat today, our infra team had to redeploy the cluster with metalLB instead of kube-vip so haven't changed it back but usually we have set this value accordingly. But the issue still seems to occur. Our infra team has also tried to deploy only ingress on a different cluster but they found the similar issue where they are not able to resolve the service name to an IP. Also, when i checked, i found that reverse nslook up works where ip resolves to a service dns name. Any idea what could be going wrong? Do you see anything not right in the iac ansible-vars details i shared?
this is an sample command of what i meant: kubectl run --rm -it --image curlimages/curl dns-test --restart=Never -- nslookup cert-manager-webhook.cert-manager.svc Server: 10.255.0.zz Address: 10.255.0.zz:53 ** server can't find cert-manager-webhook.cert-manager.svc: NXDOMAIN
** server can't find cert-manager-webhook.cert-manager.svc: NXDOMAIN
@jarpat I figured out the issue but not the solution. So want to check if you are aware of it. when i run the kubectl run --rm -it --image curlimages/curl dns-test --restart=Never -- nslookup cert-manager-webhook.cert-manager.svc.cluster.local command and include the default name cluster.local, it does resolve to an ip address. So i understand that coredns is not able to resolve the short names. Any advise on this issue?
Or is it that, we should be able to update cert manager webhook service to register with domain name?
Or something else could be going wrong?
Thanks in advance! Raghu
I tried deploying with similar infrastructure as you and had no luck recreating the issue. Going off the the configuration you posted could it be that your control_plane
IPs & cluster_lb_addresses
are conflicting?
From above:
# ...truncated
cluster_lb_addresses = ["10.0.X.124-10.0.X.129"]
control_plane = {
count = 3
cpus = 2
memory = 4096
os_disk = 100
ip_addresses = [
"10.0.X.114","10.0.X.125","10.0.X.126",
]
node_taints = []
node_labels = {}
}
# ...trucated
It's been 30 days since the last response, marking as stale.
Terraform Version Details
No response
Terraform Variable File Details
No response
Ansible Variable File Details
No response
Steps to Reproduce
Deploy env using this project on vpshere. used kubevip and also redeployed using metallb but with same issue. deployed sas or ngnix and see same issue.
1 dispatcher.go:217] Failed calling webhook, failing closed validate.nginx.ingress.kubernetes.io: failed calling webhook "validate.nginx.ingress.kubernetes.io": failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1/ingresses?timeout=10s": Unknown Host
Expected Behavior
works as expected
Actual Behavior
1 dispatcher.go:217] Failed calling webhook, failing closed validate.nginx.ingress.kubernetes.io: failed calling webhook "validate.nginx.ingress.kubernetes.io": failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1/ingresses?timeout=10s": Unknown Host
Additional Context
No response
References
No response
Code of Conduct