Closed luckydogxf closed 1 year ago
I can reproducre this issue.
ubuntu@k8s-master-1:~$ kubectl get po -o wide | grep worker-3
collabora-5cd79f7f4-6v5fw 1/1 Running 0 26d 10.244.7.131 k8s-worker-3.pxxxxcom <none> <none>
nextcloud-56d54cdfcf-jr85j 1/1 Running 0 26d 10.244.7.136 k8s-worker-3.pxxxxcom <none> <none>
nginx-6b5d56b66f-qffg4 1/1 Running 2 26d 10.244.7.135 k8s-worker-3.pxxxxcom <none> <none>
opensanctions-app-postgresql-0 1/1 Running 2 (25d ago) 219d 10.244.7.230 k8s-worker-3.pxxxxcom <none> <none>
opensanctions-app-web-86cd769645-g45gj 1/1 Running 0 26d 10.244.7.134 k8s-worker-3.pxxxxcom <none> <none>
prometheus-node-exporter-z7xhf 1/1 Running 0 43h 172.16.215.156 k8s-worker-3.pxxxxcom <none> <none>
prometheus-blackbox-7bd4db7d6f-64ntz 1/1 Running 1 (104m ago) 18h 10.244.1.207 k8s-worker-5.pxxxxcom <none> <none>
prometheus-server-75874d8877-db62l 2/2 Running 2 (104m ago) 18h 10.244.1.210 k8s-worker-5.pxxxxcom <none> <none>
ubuntu@k8s-master-1:~$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
collabora ClusterIP 10.102.193.116 <none> 9980/TCP 155d
grafana ClusterIP 10.100.245.190 <none> 3000/TCP 267d
nginx ClusterIP 10.97.190.59 <none> 80/TCP 176d
prometheus-blackbox ClusterIP 10.107.44.90 <none> 9115/TCP 2d14h
ubuntu@k8s-master-1:~$ kubectl exec -it nextcloud-56d54cdfcf-jr85j /bin/sh
# telnet nginx 80
Trying 10.97.190.59...
Connected to nginx.default.svc.cluster.local.
Escape character is '^]'.
^]
# telnet grafana 3000
Trying 10.100.245.190...
Connected to grafana.default.svc.cluster.local.
Escape character is '^]'.
^]
telnet> Connection closed.
#
ubuntu@k8s-master-1:~$ kubectl exec -it prometheus-server-75874d8877-db62l /bin/sh
/ $ telnet 10.107.44.90 9115
^C
As we can see prometheus-server and blackbox are both running on worker-5, they were okay last day, but it does not work now.
my environment is kubernetes 1.26.0 and Ubuntu 20.04.
While 10.107.44.90
could be accessible from other pods from different nodes.
I added more comment here https://github.com/kubernetes/kubernetes/issues/116453
I used ksniff to capture packets against po1 and locate the root cause, yet I don't how how to fix it. The correct steps of three way handshake would be
po1 -->clusterIp, then DNAT--->pod2, SYN clusterIp-->po1, SYN +ACK. (Here the non-working one is pod2 --->pod1 with SYN/ACK) pod--->clusterIP, ACK But in step 2, I can see the connection is po2 --->po1, NOT clusterIP --->po1. That's why the connection was reset.
In step 2, there is a task called SNAT that changes src IP of packet replied from pod2 to clusterIP, which was skipped.
I captured packet against pod2, too.
po1 --->pod2 (No problem, DNAT'ed), SYN. po2 -->po1, SYN/ACK po1 ---> po2 , RST. All shows that kube-proxy sometimes may not function well. Please ignore IP difference, it varys from time to time because I created and deleted them again and again.
working one https://drive.google.com/file/d/1c-fD42_M9GHPFdog0X4DWp580_etp7EZ/view?usp=sharing, non-working one, capture packets on pod2.
https://drive.google.com/file/d/1eXJzl7DnLBn7t6MC3h_hghA1NNJPgwSv/view?usp=sharing, non-working on, captured on pod1. https://drive.google.com/file/d/1m5IilI5nJEpXv440_93QDNDbJ6tU8ULF/view?usp=sharing
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
I added more comment here kubernetes/kubernetes#116453
yeah, sorry, just noticing this now. We don't use this repo for bug reports; kube-proxy issues should be reported to kubernetes/kubernetes, so please follow up there if this is still a problem.
/close
@danwinship: Closing this issue.
I have two pods on the same node.
However,
prometheus-server
cannot accessClusterIP
ofopenstack-exporter
.Here is
tcpdump
againstcni0
.We can see
clusterIP
does not return anything.I use flannel vXLAN and
ipvs
mode. Please help, thanks.Weird thing is that it works initially, and then it fails after running a while.