Guestbook-Go example - redis slave pods are not able to connect to redis master pod when running the deployment on 2 worker nodes

idankish commented 2 years ago

Hi there, I deployed the "Guestbook-Go example" deployment on a K8S cluster, consists of a master node and 2 worker nodes. In this case the "redis-master" pod was deployed on "node-1" worker node and the 2 "redis-slave" pods were deployed on each worker node respectively. Please see below:

[root@master-node ~]# kubectl get pods --all-namespaces -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default guestbook-8w7ld 1/1 Running 0 36h 10.36.0.4 node-2 default guestbook-bk8zn 1/1 Running 0 36h 10.44.0.3 node-1 default guestbook-pvh77 1/1 Running 0 36h 10.36.0.3 node-2 default redis-master-sgbch 1/1 Running 0 36h 10.36.0.1 node-2 default redis-slave-r7kbk 1/1 Running 0 36h 10.44.0.2 node-1 default redis-slave-sxqfm 1/1 Running 0 36h 10.36.0.2 node-2

Everything seems to be working, but the problem that I am facing now is that the "redis slave" pods are not able to connect to the "redis master" pod.

[root@master-node ~]# kubectl logs -f redis-slave-r7kbk [8] 04 Jan 08:40:01.708 # Unable to connect to MASTER: Connection timed out [8] 04 Jan 08:40:02.711 Connecting to MASTER redis-master:6379 [8] 04 Jan 08:40:22.732 # Unable to connect to MASTER: Connection timed out [8] 04 Jan 08:40:23.734 Connecting to MASTER redis-master:6379

[root@master-node ~]# kubectl logs -f redis-slave-sxqfm 9] 04 Jan 08:40:29.719 Connecting to MASTER redis-master:6379 [9] 04 Jan 08:40:49.736 # Unable to connect to MASTER: Connection timed out [9] 04 Jan 08:40:50.739 Connecting to MASTER redis-master:6379 [9] 04 Jan 08:41:10.759 # Unable to connect to MASTER: Connection timed out

This is while, I can see that the "redis-master" pod seems to be up and running and listening to port 6379:

[root@master-node ~]# kubectl logs -f redis-master-sgbch . _.-__ ''-._ _.- .. ''-. Redis 2.8.19 (00000000/0) 64 bit .-.-```. ```\/ _.,_ ''-._ ( ' , .-` | `, ) Running in stand alone mode |`-._`-...-` __...-.-.|'` .-'| Port: 6379 | -._. / .-' | PID: 1 -._-. `-./ .-' .-' |`-.-._-..-' .-'.-'| | -._-. .-'.-' | http://redis.io `-. -._-..-'.-' .-' |-._-._ -.__.-' _.-'_.-'| |-.`-. .-'.-' | -._-._-.__.-'_.-' _.-' -._ -.__.-' _.-' -. .-' `-.__.-'

[1] 02 Jan 20:02:41.291 # Server started, Redis version 2.8.19 [1] 02 Jan 20:02:41.292 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled. [1] 02 Jan 20:02:41.292 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. [1] 02 Jan 20:02:41.292 * The server is now ready to accept connections on port 6379

I followed the exact instructions and used the configuration files to deploy the services, but yet I can't figure out what is wrong: Here is the output of the "get svc":

[root@master-node ~]# kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE guestbook LoadBalancer 10.110.94.208 194.233.163.249 3000:30508/TCP 36h kubernetes ClusterIP 10.96.0.1 443/TCP 2d11h redis-master ClusterIP 10.105.56.60 6379/TCP 36h redis-slave ClusterIP 10.105.1.8 6379/TCP 36h

[root@master-node ~]# kubectl describe svc redis-master Name: redis-master Namespace: default Labels: app=redis role=master Annotations: Selector: app=redis,role=master Type: ClusterIP IP Family Policy: SingleStack IP Families: IPv4 IP: 10.105.56.60 IPs: 10.105.56.60 Port: 6379/TCP TargetPort: redis-server/TCP Endpoints: 10.36.0.1:6379 Session Affinity: None Events:

While trying to troubleshoot it, I also tried to telnet to the "redis-master" pod's internal IP and port (10.36.0.1:6379) from both worker nodes. The port is reachable from the worker node, where the "redis-master" pod is running:

[root@node-2 ~]# telnet 10.36.0.1 6379 Trying 10.36.0.1... Connected to 10.36.0.1. Escape character is '^]'.

But not reachable from the second worker node:

[root@node-1 ~]# telnet 10.36.0.1 6379 Trying 10.36.0.1... telnet: connect to address 10.36.0.1: Connection timed out

^] telnet> q Connection closed.

Even ping to the "redis-master" pod's internal IP address from worker node-1 doesn't work:

[root@node-1 ~]# ping 10.36.0.1 PING 10.36.0.1 (10.36.0.1) 56(84) bytes of data.

^C --- 10.36.0.1 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3051ms

I also disabled the internal CetnoOS firewall from both worker nodes, but yet with no success.

Any idea what can be wrong? Or what I am missing?

Thanks, Idan

warrensbox commented 2 years ago

@idankish I proposed a temporary fix. I hope it helps. See #437

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/examples/issues/433#issuecomment-1172818106): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / examples

Guestbook-Go example - redis slave pods are not able to connect to redis master pod when running the deployment on 2 worker nodes #433