kubernetes / kube-proxy

kube-proxy component configs
Apache License 2.0
178 stars 89 forks source link

one pod cannot access clusterIP of another pod that running on the same node #19

Closed luckydogxf closed 1 year ago

luckydogxf commented 1 year ago

I have two pods on the same node.

openstack-exporter-5555444865-bctlh                1/1     Running     1 (4h55m ago)   6h33m   10.244.2.61      k8s-worker-4.pax-texes.com

prometheus-server-6c549c7d4b-fmfvz                 2/2     Running     0               4h22m   10.244.2.98       k8s-worker-4.pax-texes.com

However, prometheus-servercannot access ClusterIP of openstack-exporter.

openstack-exporter                      ClusterIP   10.101.122.233   <none>        9180/TCP            6h43m

Here is tcpdump against cni0.

ubuntu@k8s-worker-4:~$ sudo tcpdump -i cni0  -vvv host 10.101.122.233  -n
tcpdump: listening on cni0, link-type EN10MB (Ethernet), capture size 262144 bytes

17:31:08.094903 IP (tos 0x0, ttl 64, id 38276, offset 0, flags [DF], proto TCP (6), length 60)
    10.244.2.98.56726 > 10.101.122.233.9180: Flags [S], cksum 0x92d2 (incorrect -> 0x40c0), seq 816090733, win 62370, options [mss 8910,sackOK,TS val 1403735983 ecr 0,nop,wscale 7], length 0
17:31:09.102061 IP (tos 0x0, ttl 64, id 38277, offset 0, flags [DF], proto TCP (6), length 60)
    10.244.2.98.56726 > 10.101.122.233.9180: Flags [S], cksum 0x92d2 (incorrect -> 0x3cd1), seq 816090733, win 62370, options [mss 8910,sackOK,TS val 1403736990 ecr 0,nop,wscale 7], length 0
17:31:11.118078 IP (tos 0x0, ttl 64, id 38278, offset 0, flags [DF], proto TCP (6), length 60)
    10.244.2.98.56726 > 10.101.122.233.9180: Flags [S], cksum 0x92d2 (incorrect -> 0x34f1), seq 816090733, win 62370, options [mss 8910,sackOK,TS val 1403739006 ecr 0,nop,wscale 7], length 0

We can see clusterIP does not return anything.

I use flannel vXLAN and ipvs mode. Please help, thanks.

Weird thing is that it works initially, and then it fails after running a while.

luckydogxf commented 1 year ago

I can reproducre this issue.

ubuntu@k8s-master-1:~$ kubectl get po -o wide | grep worker-3
collabora-5cd79f7f4-6v5fw                          1/1     Running     0             26d     10.244.7.131     k8s-worker-3.pxxxxcom   <none>           <none>
nextcloud-56d54cdfcf-jr85j                         1/1     Running     0             26d     10.244.7.136     k8s-worker-3.pxxxxcom   <none>           <none>
nginx-6b5d56b66f-qffg4                             1/1     Running     2             26d     10.244.7.135     k8s-worker-3.pxxxxcom   <none>           <none>
opensanctions-app-postgresql-0                     1/1     Running     2 (25d ago)   219d    10.244.7.230     k8s-worker-3.pxxxxcom   <none>           <none>
opensanctions-app-web-86cd769645-g45gj             1/1     Running     0             26d     10.244.7.134     k8s-worker-3.pxxxxcom   <none>           <none>
prometheus-node-exporter-z7xhf                     1/1     Running     0             43h     172.16.215.156   k8s-worker-3.pxxxxcom   <none>           <none>
prometheus-blackbox-7bd4db7d6f-64ntz               1/1     Running     1 (104m ago)   18h     10.244.1.207     k8s-worker-5.pxxxxcom   <none>           <none>
prometheus-server-75874d8877-db62l                 2/2     Running     2 (104m ago)   18h     10.244.1.210     k8s-worker-5.pxxxxcom   <none>           <none>

ubuntu@k8s-master-1:~$ kubectl get svc
NAME                                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                       AGE
collabora                               ClusterIP   10.102.193.116   <none>        9980/TCP                      155d
grafana                                 ClusterIP   10.100.245.190   <none>        3000/TCP                      267d
nginx                                   ClusterIP   10.97.190.59     <none>        80/TCP                        176d
prometheus-blackbox                     ClusterIP   10.107.44.90     <none>        9115/TCP                      2d14h

ubuntu@k8s-master-1:~$ kubectl exec -it nextcloud-56d54cdfcf-jr85j /bin/sh

# telnet nginx 80
Trying 10.97.190.59...
Connected to nginx.default.svc.cluster.local.
Escape character is '^]'.
^]

# telnet grafana 3000
Trying 10.100.245.190...
Connected to grafana.default.svc.cluster.local.
Escape character is '^]'.
^]
telnet> Connection closed.
#

ubuntu@k8s-master-1:~$ kubectl exec -it prometheus-server-75874d8877-db62l /bin/sh

/ $ telnet 10.107.44.90 9115
^C

As we can see prometheus-server and blackbox are both running on worker-5, they were okay last day, but it does not work now.

luckydogxf commented 1 year ago

my environment is kubernetes 1.26.0 and Ubuntu 20.04.

luckydogxf commented 1 year ago

While 10.107.44.90 could be accessible from other pods from different nodes.

luckydogxf commented 1 year ago

I added more comment here https://github.com/kubernetes/kubernetes/issues/116453

luckydogxf commented 1 year ago

I used ksniff to capture packets against po1 and locate the root cause, yet I don't how how to fix it. The correct steps of three way handshake would be

po1 -->clusterIp, then DNAT--->pod2, SYN clusterIp-->po1, SYN +ACK. (Here the non-working one is pod2 --->pod1 with SYN/ACK) pod--->clusterIP, ACK But in step 2, I can see the connection is po2 --->po1, NOT clusterIP --->po1. That's why the connection was reset.

In step 2, there is a task called SNAT that changes src IP of packet replied from pod2 to clusterIP, which was skipped.

I captured packet against pod2, too.

po1 --->pod2 (No problem, DNAT'ed), SYN. po2 -->po1, SYN/ACK po1 ---> po2 , RST. All shows that kube-proxy sometimes may not function well. Please ignore IP difference, it varys from time to time because I created and deleted them again and again.

working one https://drive.google.com/file/d/1c-fD42_M9GHPFdog0X4DWp580_etp7EZ/view?usp=sharing, non-working one, capture packets on pod2.

https://drive.google.com/file/d/1eXJzl7DnLBn7t6MC3h_hghA1NNJPgwSv/view?usp=sharing, non-working on, captured on pod1. https://drive.google.com/file/d/1m5IilI5nJEpXv440_93QDNDbJ6tU8ULF/view?usp=sharing

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

danwinship commented 1 year ago

I added more comment here kubernetes/kubernetes#116453

yeah, sorry, just noticing this now. We don't use this repo for bug reports; kube-proxy issues should be reported to kubernetes/kubernetes, so please follow up there if this is still a problem.

danwinship commented 1 year ago

/close

k8s-ci-robot commented 1 year ago

@danwinship: Closing this issue.

In response to [this](https://github.com/kubernetes/kube-proxy/issues/19#issuecomment-1658326117): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.