sassoftware / viya4-iac-k8s

This project contains Terraform scripts to provision cloud infrastructure resources, when using vSphere, and Ansible to apply the needed elements of a Kubernetes cluster that are required to deploy SAS Viya platform product offerings.
Apache License 2.0
23 stars 15 forks source link

DNS resolution fails. Tried both kube-vip and mitalLB #122

Closed SASCloudLearner closed 1 month ago

SASCloudLearner commented 2 months ago

Terraform Version Details

No response

Terraform Variable File Details

No response

Ansible Variable File Details

No response

Steps to Reproduce

Deploy env using this project on vpshere. used kubevip and also redeployed using metallb but with same issue. deployed sas or ngnix and see same issue.

1 dispatcher.go:217] Failed calling webhook, failing closed validate.nginx.ingress.kubernetes.io: failed calling webhook "validate.nginx.ingress.kubernetes.io": failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1/ingresses?timeout=10s": Unknown Host

Expected Behavior

works as expected

Actual Behavior

1 dispatcher.go:217] Failed calling webhook, failing closed validate.nginx.ingress.kubernetes.io: failed calling webhook "validate.nginx.ingress.kubernetes.io": failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1/ingresses?timeout=10s": Unknown Host

Additional Context

No response

References

No response

Code of Conduct

jarpat commented 2 months ago

Hey @SASCloudLearner,

To help get some more context to diagnose your issue could you provide your ansible-vars.yaml that was generated using this project (viya4-iac-k8s) and if you are using the viya4-deployment project to perform your Viya deployment could you also provide the ansible-vars.yaml you used?

SASCloudLearner commented 2 months ago

below is from iac ansible-vars file. And SAS Ansible vars file is what i reported at https://github.com/sassoftware/viya4-deployment/issues/548. Please let me know if this information provides sufficient details

General items

ansible_user = "ansible" ansible_password = "xxxxxxxxxxx" prefix = "ILT" # Infra prefix gateway = "10.0.x.xxx" # Gateway for servers netmask = 21 # Network interface netmask

vSphere

vsphere_server = "abc.X0.kk.lan" # Name of the vSphere server vsphere_datacenter = "Datacenter" # Name of the vSphere data center vsphere_datastore = "XX_XX_XX" # Name of the vSphere data store to use for the VMs vsphere_resource_pool = "K8S-SAS-XX" # Name of the vSphere resource pool to use for the VMs vsphere_folder = "XX/XXX/K8S-SAS/Dev" # Name of the vSphere folder to store the vms vsphere_template = "Template Ubuntu 22.04" # Name of the VM template to clone to create VMs for the cluster vsphere_network = "XXXXXXX_3000" # Name of the network to to use for the VMs

Systems

system_ssh_keys_dir = "~/.ssh/XXX" # Directory holding public keys to be used on each system

Kubernetes - Cluster

cluster_version = "1.27.11" # Kubernetes Version cluster_cni = "calico" # Kubernetes Container Network Interface (CNI) cluster_cni_version = "3.27.2" # Kubernetes Container Network Interface (CNI) Version cluster_cri = "containerd" # Kubernetes Container Runtime Interface (CRI) cluster_cri_version = "1.6.28" # Kubernetes Container Runtime Interface (CRI) Version cluster_service_subnet = "10.255.0.0/17" # Kubernetes Service Subnet cluster_pod_subnet = "10.255.128.0/17" # Kubernetes Pod Subnet cluster_domain = "X9.XXXX.lan" # Cluster domain suffix for DNS

Kubernetes - Cluster VIP

cluster_vip_version = "0.7.1" cluster_vip_ip = "10.0.X.XXX" cluster_vip_fqdn = "sas-kube-vip-dev.X9.XXXX.lan"

Kubernetes - Load Balancer

Load Balancer Type

cluster_lb_type = "metallb" # Load Balancer accepted values [kube_vip,metallb]

Load Balancer Addresses

#

Examples for each load balancer type can be found here:

#

kube-vip address format : https://kube-vip.io/docs/usage/cloud-provider/#the-kube-vip-cloud-provider-configmap

MetalLB address format : https://metallb.universe.tf/configuration/#layer-2-configuration

#

kube-vip sample:

#

cluster_lb_addresses = [

"cidr-default: 192.168.0.200/29", # CIDR-based IP range for use in the default Namespace

"range-development: 192.168.0.210-192.168.0.219", # Range-based IP range for use in the development Namespace

"cidr-finance: 192.168.0.220/29,192.168.0.230/29", # Multiple CIDR-based ranges for use in the finance Namespace

"cidr-global: 192.168.0.240/29" # CIDR-based range which can be used in any Namespace

]

#

MetalLB sample:

#

cluster_lb_addresses = [

"192.168.10.0/24",

"192.168.9.1-192.168.9.5"

]

#

NOTE: If you are assigning a static IP using the loadBalancerIP value for your

load balancer controller service when using metallb that IP must fall

within the address range you provide below. If you are using kube_vip

you do not have this limitation.

# cluster_lb_addresses = ["10.0.X.124-10.0.X.129"]

Control plane node shared ssh key name

control_plane_ssh_key_name = "XX_ssh"

Cluster Node Pools config

#

Your node pools must contain at least 3 or more nodes.

The required node types are:

#

* control_plane - Having an odd number 3/5/7... ensures

HA while using kube-vip

* system - System node pool to run misc pods, etc

* cas - CAS Nodes

* - Any number of node types with unique names.

These are typically: compute, stateful, and

stateless.

# node_pools = {

REQUIRED NODE TYPE - DO NOT REMOVE and DO NOT CHANGE THE NAME

Other variables may be altered

control_plane = { count = 3 cpus = 2 memory = 4096 os_disk = 100 ip_addresses = [ "10.0.X.114","10.0.X.125","10.0.X.126", ] node_taints = [] node_labels = {} },

REQUIRED NODE TYPE - DO NOT REMOVE and DO NOT CHANGE THE NAME

Other variables may be altered

system = { count = 1 cpus = 8 memory = 65536 os_disk = 100 ip_addresses = [ "10.0.X.127", ] node_taints = [] node_labels = { "kubernetes.azure.com/mode" = "system" # REQUIRED LABEL - DO NOT REMOVE } }, cas = { count = 3 cpus = 8 memory = 196608 os_disk = 100 misc_disks = [ 300, ] ip_addresses = [ "10.0.x.120", "10.0.x.121", "10.0.x.122", ] node_taints = ["workload.sas.com/class=cas:NoSchedule"] node_labels = { "workload.sas.com/class" = "cas" } }, compute = { cpus = 8 memory = 65536 os_disk = 100 ip_addresses = [ "10.0.3.130", ] node_taints = ["workload.sas.com/class=compute:NoSchedule"] node_labels = { "workload.sas.com/class" = "compute" "launcher.sas.com/prepullImage" = "sas-programming-environment" } }, stateful = { cpus = 8 memory = 65536 os_disk = 100 ip_addresses = [ "10.0.x.101", ] node_taints = ["workload.sas.com/class=stateful:NoSchedule"] node_labels = { "workload.sas.com/class" = "stateful" } }, stateless = { cpus = 16 memory = 131072 os_disk = 100 misc_disks = [ 150, ] ip_addresses = [ "10.0.X.102", ] node_taints = ["workload.sas.com/class=stateless:NoSchedule"] node_labels = { "workload.sas.com/class" = "stateless" } } }

Jump server

create_jump = false # Creation flag jump_num_cpu = 4 # 4 CPUs jump_memory = 8092 # 8 GB jump_disk_size = 100 # 100 GB jump_ip = "10.0.x.111" # Assigned values for static IPs

NFS server

create_nfs = true # Creation flag nfs_num_cpu = 8 # 8 CPUs nfs_memory = 16384 # 16 GB nfs_disk_size = 2000 # 500 GB nfs_ip = "10.0.x.112" # Assigned values for static IPs

Postgres Servers

postgres_servers = { default = { server_num_cpu = 8 # 8 CPUs server_memory = 16384 # 16 GB server_disk_size = 250 # 256 GB server_ip = "10.0.x.113" # Assigned values for static IPs server_version = 13 # PostgreSQL version server_ssl = "off" # SSL flag administrator_login = "postgres" # PostgreSQL admin user - CANNOT BE CHANGED administrator_password = "xxxxxxxxx" # PostgreSQL admin user password } }

jarpat commented 2 months ago

Going off your ansible-vars.yaml from viya4-deployment issue, I believe it's an issue with your INGRESS_NGINX_CONFIG which is commented out.

Depending on the Load Balancer type you chose, kube-vip vs metallb, you will need to adjust externalTrafficPolicy accordingly. We have this documented here: https://github.com/sassoftware/viya4-iac-k8s/blob/main/docs/REQUIREMENTS.md#deployment

Side note: you can surround text with a triple backticks in the comments so it's formatted as a inline code block so that the yaml you paste gets rendered correctly, at the moment it's hard to read.

foo:
  bar: value
  bar2: value2
SASCloudLearner commented 2 months ago

Hi @jarpat today, our infra team had to redeploy the cluster with metalLB instead of kube-vip so haven't changed it back but usually we have set this value accordingly. But the issue still seems to occur. Our infra team has also tried to deploy only ingress on a different cluster but they found the similar issue where they are not able to resolve the service name to an IP. Also, when i checked, i found that reverse nslook up works where ip resolves to a service dns name. Any idea what could be going wrong? Do you see anything not right in the iac ansible-vars details i shared?

SASCloudLearner commented 2 months ago

this is an sample command of what i meant: kubectl run --rm -it --image curlimages/curl dns-test --restart=Never -- nslookup cert-manager-webhook.cert-manager.svc Server: 10.255.0.zz Address: 10.255.0.zz:53 ** server can't find cert-manager-webhook.cert-manager.svc: NXDOMAIN

** server can't find cert-manager-webhook.cert-manager.svc: NXDOMAIN

SASCloudLearner commented 2 months ago

@jarpat I figured out the issue but not the solution. So want to check if you are aware of it. when i run the kubectl run --rm -it --image curlimages/curl dns-test --restart=Never -- nslookup cert-manager-webhook.cert-manager.svc.cluster.local command and include the default name cluster.local, it does resolve to an ip address. So i understand that coredns is not able to resolve the short names. Any advise on this issue?

Or is it that, we should be able to update cert manager webhook service to register with domain name?

Or something else could be going wrong?

Thanks in advance! Raghu

jarpat commented 2 months ago

I tried deploying with similar infrastructure as you and had no luck recreating the issue. Going off the the configuration you posted could it be that your control_plane IPs & cluster_lb_addresses are conflicting?

From above:

# ...truncated
cluster_lb_addresses = ["10.0.X.124-10.0.X.129"]

control_plane = {
count = 3
cpus = 2
memory = 4096
os_disk = 100
ip_addresses = [
"10.0.X.114","10.0.X.125","10.0.X.126",
]
node_taints = []
node_labels = {}
}
# ...trucated
jarpat commented 1 month ago

It's been 30 days since the last response, marking as stale.