Calico loses pod network

megastallman commented 4 years ago

Expected Behavior

Pods should communicate to each other.

Current Behavior

Once a couple of months, on some random production node one pod loses connection to another. Pods reside on different nodes, sharing the same namespace.

Possible Solution

Probably Calico goes out of sync...

Steps to Reproduce (for bugs)

It happens too rare. Nearly impossible to reproduce. I can only provide some logs that look no different from normal operation. If I restart the affected pod, Calico resyncs and networking gets restored. So, at least we need to recreate the 'cali*' network interface.

Context

That is how Nginx loses its backend:

[error] 47#47: *151 upstream timed out (110: Connection timed out) while connecting to upstream

And netstat output:

tcp        0      <Nginx IP>:33430      <PHP_FPM IP>:9000     SYN_SENT

Your Environment

Calico version - v3.8.1
Orchestrator version - kubernetes: Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Operating System and version: Linux xxxxx 5.0.0-1011-aws #12-Ubuntu SMP Tue Jul 2 18:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux, Ubuntu 19.04

song-jiang commented 4 years ago

It seems there is not enough information on how to move forward to debug this. @megastallman Once it happens in the future, maybe leave the pod running and collect tcpdump/iptables rules?

lwr20 commented 3 years ago

Also output of ip route on the node with the failed pod.

If you happen to see a pod in this state, you might try jumping onto the Calico slack http://slack.projectcalico.org/ and we can try to help you live.

projectcalico / calico