Open megastallman opened 4 years ago
It seems there is not enough information on how to move forward to debug this. @megastallman Once it happens in the future, maybe leave the pod running and collect tcpdump/iptables rules?
Also output of ip route
on the node with the failed pod.
If you happen to see a pod in this state, you might try jumping onto the Calico slack http://slack.projectcalico.org/ and we can try to help you live.
Expected Behavior
Pods should communicate to each other.
Current Behavior
Once a couple of months, on some random production node one pod loses connection to another. Pods reside on different nodes, sharing the same namespace.
Possible Solution
Probably Calico goes out of sync...
Steps to Reproduce (for bugs)
It happens too rare. Nearly impossible to reproduce. I can only provide some logs that look no different from normal operation. If I restart the affected pod, Calico resyncs and networking gets restored. So, at least we need to recreate the 'cali*' network interface.
Context
That is how Nginx loses its backend:
And netstat output:
Your Environment