kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.16k stars 6.48k forks source link

HTTP requests not forwarded by K8s cluster to nginx ingress running on it. #9981

Closed neo3matrix closed 8 months ago

neo3matrix commented 1 year ago

Issue: My http requests to nginx ingress controller doesn't reach to nginx ingress contrller running on k8s cluster. After running traceroute and other commands, I suspect something within k8s DNS(?) or networking is prohibiting my http requests from reaching to nginx ingrress contrller.

General setup: I have 114 servers in my data center. I have 2 different k8s cluster setups via kubespray on 3 servers each. (3 servers for cluster1 & 3 servers for cluster2).

On each k8s cluster:

I have observed that, not all other servers from my DC can send http request to my nginx ingress in either clusters. All these servers can ping the cluster as well as my-nginx1.mycompany.com OR my-nginx1.mycompany.com DNS just fine. No firewall OR networking issues ( I confirm). Also, Few servers can send http request to cluster1 but almost only couple of servers can send http request to cluster2 even though the setup is EXACTLY same - except for external IP & DNS. With traceroute and other commands, I suspect something in kubernetes DNS(?) is causing the problem. No logs in nginx ingress controller pod -as if the request didn't reach to it.

Can someone please help me?

Environment:

Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.6", GitCommit:"b39bf148cd654599a52e867485c02c4f9d28b312", GitTreeState:"clean", BuildDate:"2022-09-21T13:12:04Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}

**Nginx helm chart version**:

> nginx-ingress-0.16.1

**metallb helm chart version**:

> metallb-0.13.9

- **OS (`printf "$(uname -srm)\n$(cat /etc/os-release)\n"`):**

Linux 3.10.0-1160.31.1.el7.x86_64 x86_64 NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"


- **Version of Ansible** (`ansible --version`):

- **Version of Python** (`python --version`):

**Kubespray version (commit) (`git rev-parse --short HEAD`):**

> 18efdc2c5

**Network plugin used**:

> Calico
sohnaeo commented 1 year ago

@neo3matrix

To narrow down the issue, can you ping the pod ips from the boxes? For example, run nginx pod in any namespace and then try to ping this ip from any worker node. Does it work? This will help to eliminate any routing/firewall/calico issues.

Are you nodes on VMWARE Infra?

neo3matrix commented 1 year ago

@neo3matrix

To narrow down the issue, can you ping the pod ips from the boxes? For example, run nginx pod in any namespace and then try to ping this ip from any worker node. Does it work? This will help to eliminate any routing/firewall/calico issues.

Are you nodes on VMWARE Infra?

@sohnaeo Thank you for your quick reply.

No, my nodes are physical servers not on VMWARE.

try to ping this ip from any worker node.

Yes, ping to the nginx pod's IP works fine from every worker node.

neo3matrix commented 1 year ago

Can anyone please help?

sohnaeo commented 1 year ago

@neo3matrix To narrow down the issue, can you ping the pod ips from the boxes? For example, run nginx pod in any namespace and then try to ping this ip from any worker node. Does it work? This will help to eliminate any routing/firewall/calico issues. Are you nodes on VMWARE Infra?

@sohnaeo Thank you for your quick reply.

No, my nodes are physical servers not on VMWARE.

try to ping this ip from any worker node.

Yes, ping to the nginx pod's IP works fine from every worker node.

Sorry for the late reply. Can you try to issue below command against your nginx ingress node port. Are you running nginx ingress on node port?

curl 127.0.0.1:33000 --header 'Host: youringressslink.com'

33000 is ingress node port at my end, change as per your env Host: changeas per your domain

neo3matrix commented 1 year ago

@sohnaeo Hi, No, I am not running nginx ingress on node port. I am running it on load balancer. Nginx ingress is a LoadBalancer service and metalLB load balancer gives it an external IP address from my (unused IP) subnet pool.

sohnaeo commented 1 year ago

@sohnaeo Hi, No, I am not running nginx ingress on node port. I am running it on load balancer. Nginx ingress is a LoadBalancer service and metalLB load balancer gives it an external IP address from my (unused IP) subnet pool.

In this case, I will take one step back. Lets assume you have nginx pod installed. Can you get its ip and run curl on any node you should get HTTP response nginx welcome page. I would also create service and try to access nginx welcome page against service to make sure kube -proxy works. Nginx forwards the request to Service so service should be accessible. If both pod ips and service are working then defintely calico networking is fine

If both pod ip/service works then we can look further look into ingress service then.

neo3matrix commented 1 year ago

@sohnaeo Hi,

  1. Got my nginx pod IP and tried curl from all 3 nodes to this one - I get the welcome message from nginx. (404 not found - which is default message). So, curl command works from all 3 nodes on nginx pod IP.
  2. I didn't get the service part you asked for. Could you please give me a rough example of what are you expecting when you said "I would also create service and try to access nginx welcome page".
sohnaeo commented 1 year ago

@sohnaeo Hi,

  1. Got my nginx pod IP and tried curl from all 3 nodes to this one - I get the welcome message from nginx. (404 not found - which is default message). So, curl command works from all 3 nodes on nginx pod IP.
  2. I didn't get the service part you asked for. Could you please give me a rough example of what are you expecting when you said "I would also create service and try to access nginx welcome page".

below commands will do the job

kubectl create deployment nginx --image=nginx --port=80 kubectl expose deployment nginx

kubectl get svc

curl http://svc_ip

https://kubernetes.io/docs/tutorials/kubernetes-basics/expose/expose-intro/

neo3matrix commented 1 year ago

@sohnaeo Thank you for giving that pointer.

Yes, I confirm that curl http://<pod_IP> as well as curl http://svc_ip - both these curl commands work successfully on all 3 K8s nodes.

So, if the calico network works fine, then here's my nginx ingress controller service to debug:

$ kubectl describe svc nginx-stable-nginx-ingress -n nginx-stable Name: nginx-stable-nginx-ingress Namespace: nginx-stable Labels: app.kubernetes.io/instance=nginx-stable app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=nginx-stable-nginx-ingress helm.sh/chart=nginx-ingress-0.16.1 Annotations: meta.helm.sh/release-name: nginx-stable meta.helm.sh/release-namespace: nginx-stable metallb.universe.tf/ip-allocated-from-pool: first-pool Selector: app=nginx-stable-nginx-ingress Type: LoadBalancer IP Family Policy: SingleStack IP Families: IPv4 IP: 10.233.54.186 IPs: 10.233.54.186 LoadBalancer Ingress: My-external-IP-from-subnet Port: http 80/TCP TargetPort: 80/TCP NodePort: http 32438/TCP Endpoints: 10.233.95.243:80 Port: https 443/TCP TargetPort: 443/TCP NodePort: https 30686/TCP Endpoints: 10.233.95.243:443 Session Affinity: None External Traffic Policy: Local HealthCheck NodePort: 31540 Events:

sohnaeo commented 1 year ago

@sohnaeo Thank you for giving that pointer.

Yes, I confirm that curl http://<pod_IP> as well as curl http://svc_ip - both these curl commands work successfully on all 3 K8s nodes.

Are there any networking policies in place?

kubectl get netpols -A

if not then I have doubt about your metalLB setup.

neo3matrix commented 1 year ago

I don't see any network policy here:

$ kubectl get netpols -A error: the server doesn't have a resource type "netpols"

sohnaeo commented 1 year ago

kubectl get netpols -A

Sorry typo,

kubectl get netpol -A

neo3matrix commented 1 year ago

$ kubectl get netpol -A No resources found

Looks like there isn't any.

sohnaeo commented 1 year ago

$ kubectl get netpol -A No resources found

Looks like there isn't any.

What error you gte when you browse to the ingress link? 404 or timeout? Please check your metaLB setup. it seems networking issues between nginx ingress and metaLB .

neo3matrix commented 1 year ago

It's little strange. From some servers, I can get 404 error but from some servers (or even from my laptop), I am getting timeout.

Does kubespray by default sets up any network policy during installation? Let me paste my metalLB config here in next comment.

sohnaeo commented 1 year ago

It's little strange. From some servers, I can get 404 error but from some servers (or even from my laptop), I am getting timeout.

Does kubespray by default sets up any network policy during installation? Let me paste my metalLB config here in next comment.

No, kubespray doesnt deploy any network policy by default. How do you connect from your laptop to metalb check that network segment.

neo3matrix commented 1 year ago

$ cat roles/metallb-loadbalancer/templates/l2advertisement.yaml

apiVersion: metallb.io/v1beta1 kind: L2Advertisement metadata: name: example namespace: {{ k8s_metallb_release_name }}

$ cat roles/metallb-loadbalancer/templates/ipaddresspool.yaml

apiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: name: first-pool namespace: {{ k8s_metallb_release_name }} spec: addresses: {{ ip_range_array }}

/usr/local/bin/helm install "{{ k8s_metallb_release_name }}" --create-namespace --namespace="{{ k8s_metallb_release_name }}" metallb/metallb --wait

sohnaeo commented 1 year ago

$ cat roles/metallb-loadbalancer/templates/l2advertisement.yaml

apiVersion: metallb.io/v1beta1 kind: L2Advertisement metadata: name: example namespace: {{ k8s_metallb_release_name }}

$ cat roles/metallb-loadbalancer/templates/ipaddresspool.yaml

apiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: name: first-pool namespace: {{ k8s_metallb_release_name }} spec: addresses: {{ ip_range_array }}

/usr/local/bin/helm install "{{ k8s_metallb_release_name }}" --create-namespace --namespace="{{ k8s_metallb_release_name }}" metallb/metallb --wait

are you using core dns? If you are getting timeout then it is not an DNS issue I believe. you can test DNS by running busybox and use nslookup utility to resolve the A record.

neo3matrix commented 1 year ago

@sohnaeo Yes, the coredns pods are running in kube-system. I am not explicitly doing anything there. It came as part of default installation. The A record resolves fine when I ping or do nslookup - from within or outside of cluster. It's only http requests that are getting timeout.

sohnaeo commented 1 year ago

@sohnaeo Yes, the coredns pods are running in kube-system. I am not explicitly doing anything there. It came as part of default installation. The A record resolves fine when I ping or do nslookup - from within or outside of cluster. It's only http requests that are getting timeout.

What about metalb pool ips? are these routable ips on all machines? Was it working before? Did it break after any event upgrade etc?

neo3matrix commented 1 year ago

Let me check on that area. Also, let me try another version of metalLB helm chart. I will update this post soon.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

VannTen commented 9 months ago

Can you still reproduce this ?

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 8 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/kubespray/issues/9981#issuecomment-1956465864): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.