Closed rufreakde closed 1 year ago
Hey @rufreakde thanks for filing this! Will try to dissect the problem first.
Randomly (depending on rolling restarts I assume) the ip tables of linkerd proxies that route to other pods within the service mesh just fail with error. The error is extremly similar to the following issue
Hm. Our iptables stack doesn't really handle routing to other pods. Unless iptables errors out, then the assumption is that it works to do what it is supposed to do: redirect packets through the proxy. The issue you linked (https://github.com/linkerd/linkerd2/issues/6238#issuecomment-919058743) describes a different problem that some CNI implementations tend to have.
A CNI implementation that uses eBPF can sometimes take control of your routing at a much lower level, load balancing in these cases is done at TCP accept()
time (i.e. socket lb). Unfortunately, there isn't really a good way to solve this, there's no way to interop with this at a higher level -- you're stuck with the decision the eBPF based impl made on your behalf. However, there is an easy way out. You can just disable this functionality and let Linkerd do the load balancing at a higher level in the stack.
I wrote about this in a doc page. The instructions apply to Cilium but I imagine Calico has a similar setting. It's important to note that CNI implementations that do this are supposed to short-circuit the routing decision with the main assumption being that kube-proxy is the one doing the routing. This shouldn't have an impact on other CNI features working well (i.e. you can safely disable it afaik).
I'm not sure how to interpret your logs. What's the relationship between connection failures and CNIs doing socket level load balancing? Perhaps that's another avenue we can go down on to understand what the problem is.
Edit: sorry for the close/re-open, I/O difficulties on my part :)
Seems like the default setting in calico is IPTables
which afaik means calico isn't getting into the middle of load balancing decisions https://docs.tigera.io/calico/latest/reference/installation/api#operator.tigera.io/v1.CalicoNetworkSpec
This describes what calico would do in the event you wanted it to make load balancing decisions: https://docs.tigera.io/calico/latest/about/kubernetes-training/about-kubernetes-services#calico-ebpf-native-service-handling
Seems like the default setting in calico is IPTables which afaik means calico isn't getting into the middle of load balancing decisions https://docs.tigera.io/calico/latest/reference/installation/api#operator.tigera.io/v1.CalicoNetworkSpec
Unless BPF
is used :) Which would be great to confirm here.
This describes what calico would do in the event you wanted it to make load balancing decisions: https://docs.tigera.io/calico/latest/about/kubernetes-training/about-kubernetes-services#calico-ebpf-native-service-handling
Nice! I also like this blog post they've written: Calico eBPF data plane deep dive. I think it's a bit more in-depth and helps to demystify what is happening behind the scenes (although it is a longer read).
Anyway, eBPF is a bit of a digression. Perhaps you can help me out by letting me know why you think this looks similar to https://github.com/linkerd/linkerd2/issues/6238#issuecomment-919058743, is it just the error that looks similar? That issue in particular was about bpf, hence why I'm making this association.
Proxy logs from your client/server would also be helpful here.
Hi all, sorry for my late reply I will need some time to read through the sources shared here. Also I will need to check with our central infrastructure team regarding the caliclo eBPF configuration. We are using on our side only the networkpolicies and they work fine so need to recheck that.
I will also checkout the doc but I think the other information about calico is more important.
For the logs: These where the logs for the proxy "server" the proxy "client" logs I would need to try to find them again will try to get the stuff ready after the readup. Thanks for all the help!
Okay Update from our side. We were able to identify the issue within the CNI plugin. So we switched to init containers and got a security exception for privileged containers. It seems the CNI plugin (at least on AWS) is not stable enough yet.
I was not able to get "other" logs. Most logs look arbitrary 111 errors nothing more.
@rufreakde thanks for coming back with an answer! So, as I understand, your issue has been solved?
We were able to identify the issue within the CNI plugin
Is there anything in particular that pointed to the CNI plugin being at fault? And for my understanding, by CNI plugin you mean linkerd's CNI plugin?
@rufreakde thanks for coming back with an answer! So, as I understand, your issue has been solved?
We were able to identify the issue within the CNI plugin
Is there anything in particular that pointed to the CNI plugin being at fault? And for my understanding, by CNI plugin you mean linkerd's CNI plugin?
Yes after disabling CNI removing it from the cluster and using init-containers the issues do not appear anymore. (so far so good)
Yes, we used the linkerD CNI plugin before. So now in our helm chart we have cniEnabled=false.
@mateiidavid just in general how did we assume it is the CNI plugin:
Thanks for the information! Since you managed to find a workaround, I'll be closing this issue. Let us know if you want it re-opened.
What is the issue?
Randomly (depending on rolling restarts I assume) the ip tables of linkerd proxies that route to other pods within the service mesh just fail with error. The error is extremly similar to the following issue: https://github.com/linkerd/linkerd2/issues/6238#issuecomment-919058743
But we are not using cilium but calico.
The same setup works without linkerd (linkerd.io/inject: disabled in source/target pod), meaning: the connection is lost and reopened. No request fails.
How can it be reproduced?
-create AWS cluster -install linkerd helm charts
-wait until some "sink" pod gets a rolling update restart. A pod that reaches this pod will fail. (not happening everytime though)
Logs, error output, etc
Upstream Callee POD log: Sometimes 111 error sometimes recovers sometimes outgoing error not very consistent. But it started to happen when we migrated from init container to cni plugin.
Downstream(sink) POD log: (parse_sni failed)
output of
linkerd check -o short
Environment
Possible solution
We saw this comment here: https://github.com/linkerd/linkerd2/issues/6238#issuecomment-919058743 But we are totally not sure how this is meant and if the configuration there is complete and even if this is the same issue? But the problem with 504 outgoing issue appears consistently.
Any idea what we need to check to understand the problem in more detail? The flakiness is very problematic.
Additional context
No response
Would you like to work on fixing this bug?
no