Open andreev-io opened 5 months ago
We reproduced an identical issue in AKS. We created a cluster using managed Cilium:
az group create --name ilya-experiments --location uksouth
az network vnet create -g ilya-experiments --location uksouth --name ilya-vnet --address-prefixes 10.0.0.0/8 -o table
az network vnet subnet create -g ilya-experiments --vnet-name ilya-vnet --name nodesubnet --address-prefixes 10.240.0.0/16 -o none
az network vnet subnet create -g ilya-experiments --vnet-name ilya-vnet --name podsubnet --address-prefixes 10.241.0.0/16 -o none
az aks create -n ilya-experiments -g ilya-experiments -l uksouth --max-pods 250 --network-plugin azure --vnet-subnet-id /subscriptions/<>/resourceGroups/ilya-experiments/providers/Microsoft.Network/virtualNetworks/ilya-vnet/subnets/nodesubnet --pod-subnet-id /subscriptions/<>/resourceGroups/ilya-experiments/providers/Microsoft.Network/virtualNetworks/ilya-vnet/subnets/podsubnet --network-dataplane cilium
Installed Retina like described above and observed the same symptoms.
While we start to repro and root-cause this, any additional datapoints from others running on EKS/GKE and/or with Cilium are welcome
@rbtr @anubhabMajumdar
Hey! Cilium uses the following code to load its programs on the ingress qdisc:
func replaceQdisc(link netlink.Link) error {
attrs := netlink.QdiscAttrs{
LinkIndex: link.Attrs().Index,
Handle: netlink.MakeHandle(0xffff, 0),
Parent: netlink.HANDLE_CLSACT,
}
qdisc := &netlink.GenericQdisc{
QdiscAttrs: attrs,
QdiscType: qdiscClsact,
}
return netlink.QdiscReplace(qdisc)
}
And Retina uses the following:
qdiscIngress = &tc.Object{
Msg: tc.Msg{
Family: unix.AF_UNSPEC,
Ifindex: uint32(iface.Index),
Handle: helper.BuildHandle(0xFFFF, 0x0000),
Parent: tc.HandleIngress,
},
Attribute: tc.Attribute{
Kind: "clsact",
},
}
Wouldn't the conflict in the handle and parent of qdisc explain the symptoms we are seeing here?
I encountered a similar issues in the cilium environment. When retina-agent is restarted or shut down, the pod network of the cluster cannot be connected from the outside. And it was caused by the packetparser plugin. The environment configuration is as follows:
@andreev-io I think you are onto something. Issue seems to be caused by the Packetparser plugin and the way we setup qdisc to observe traffic. Restarting the cilium pods doesn't solve the issue, requires uninstalling retina and then restarting cilium.
:wave: Cilium committer here
From taking a quick look at a cluster with Cilium and Retina, I think @andreev-io is spot on! We've had this issue before, with Datadog's agent (cf. https://github.com/cilium/cilium/issues/21345). One solution is to (1) change Retina's BPF programs so they don't bypass subsequent BPF programs and (2) explicitly tell Cilium's BPF programs to run second. I've sent https://github.com/microsoft/retina/pull/276 for the first. The second requires running Cilium with --bpf-filter-priority=2
.
This LinuxPlumbers presentation by my colleague has more information on this issue and the longer-term upstream solution we're going for (given more and more people are using BPF and need to play nice together).
Note the solution I'm proposing assumes that you want to run Retina before Cilium. I'm making that assumption because I guess you want to see packets before Cilium has a change to drop or mangle (ex. NAT) them?
@pchaigno This is a wonderful discovery! I am going to watch the presentation you linked as soon as I have some time.
For our use case, we are interested in Retina running after Cilium, since we want to measure traffic after it's filtered by Cilium and CiliumNetworkPolicies are applied. How can you instrument tc to execute programs in a deterministic order, is there a priority? I can open a PR to make this configurable for Retina.
I will defer to @pchaigno but I suspect we're going to be looking at replicating https://github.com/cilium/cilium/issues/17193.
This is all new to me, so I am wondering: since TCFilterPriority is uint, is 1
(or 0
maybe) the highest possible priority? I'm thinking about it in context of allowing the order of Cilium and Retina (or other) to be adjusted without having to modify the config of both processes? Is there a downside to, say, always deploying Cilium at priority 10
so that programs could be added in priority before or after it without reconfiguring and restarting Cilium also?
@pchaigno @andreev-io Thanks for bringing this up and helping us fix this quickly!
Couple of things we are thinking of doing to make Retina work with Cilium (and other ebpf based services) out-of-the box:
Running after Cilium is going to be tricky because our BPF programs don't return TC_ACT_UNSPEC
:sweat_smile: There are multiple reasons to that in my opinion:
That being said, there's no reason that behavior couldn't be changed (likely behind a flag). I can't talk for all committers, but we're typically quite open to new contributions if there's any use case for them.
@pchaigno That's very interesting and insightful.
We are planning to use Retina in our cloud platform for network metering for billing purposes. The other two major options we evaluated were Hubble, which doesn't work for us because it doesn't meet the requirement of collecting volumetric data (bytes-per-second and packets-per-second) with rich context (Kubernetes labels and annotations such as source/destination pod names) – its metrics cover one at a time but never both, and Cilium custom calls, which don't work for us because our platform in Azure and GCP uses managed Cilium where we don't have control over whether the custom calls functionality is enabled (we have not actually tested this yet, since this approach would involve writing our own eBPF programs, and we want to first push Retina as far as possible).
My initial thought process as to where to put Retina in the packet processing pipeline with respect to Cilium was that our metering should run after all filtering happens for the obvious reason that we shouldn't bill on packets that get dropped by the Cilium dataplane.
However, you are correct that the packets we will see after Cilium might not be sensible or be encapsulated or just be plain invisible.
The more I think about this problem, the more I'm leaning toward metering before Cilium and then making sure the billing calculation is based only on connections that are certain to be correctly routed. Still, I'm very interested in your perspective on this problem. Is there a way to solve for this with Hubble?
@andreev-io I'm wondering if you consider this issue resolved by #276 which does fix Retina and Cilium being totally incompatible. If we consider this specific issue resolved, I think there are follow-up asks to Retina/to Cilium to make it customizable in which order they are required to run?
@andreev-io Thanks for the explanation and context! It helps and makes a lot of sense to me.
The more I think about this problem, the more I'm leaning toward metering before Cilium and then making sure the billing calculation is based only on connections that are certain to be correctly routed. Still, I'm very interested in your perspective on this problem. Is there a way to solve for this with Hubble?
I don't think we have a solution ready for this in Cilium (though it's starting to be big enough that I could have missed it). It however sounds like something we could have solved with Tetragon. It's probably best to ask in the Cilium Slack to be sure to reach my Tetragon colleagues.
Hey @rbtr. I have been testing this quite extensively, and I don't think the problem is fully solved. I did a deep dive into both Cilium and Retina code and played around with them with default tc priority settings and adjusted tc priority settings (as per @pchaigno's suggestion in their PR).
What I'm observing is that when you configure Cilium to run with lower priority (e.g. 5 in my example below), sometimes both Cilium and Retina get installed correctly. More often than not, however, one overwrites the other. Most of the time, I observe that if I restart a Retina agent, Cilium programs from_container
and from_netdev
get overwritten completely and disappear; vice versa, Cilium sometimes overwrites Retina's endpoint_ingress
, endpoint_egress
, host_ingress
, host_egress
.
Occasionally a combination of programs gets loaded, but inconsistently – for example, below is one of my observations where Retina coexisted with Cilium on the ingress hook of a container veth and on the ingress hook of the eth0 interface, but did not get installed on the same veth's egress hook at all (Cilium is not expected to be there since Cilium does not attach to the veth's egress hook by default):
[root@ip-192-168-46-54 /]# tc filter show dev lxc6032e3cd1c24 ingress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 endpoint_ingres direct-action not_in_hw id 4923 tag 7313acb249e3b164 jited
filter protocol all pref 5 bpf chain 0
filter protocol all pref 5 bpf chain 0 handle 0x1 cil_from_container-lxc6032e3cd1c24 direct-action not_in_hw id 5172 tag 2a31958ad4a95e7c jited
[root@ip-192-168-46-54 /]# tc filter show dev lxc6032e3cd1c24 egress
[root@ip-192-168-46-54 /]# tc filter show dev eth0 ingress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 host_ingress_fi direct-action not_in_hw id 4925 tag c67b49b0a12a5098 jited
filter protocol all pref 5 bpf chain 0
filter protocol all pref 5 bpf chain 0 handle 0x1 cil_from_netdev-eth0 direct-action not_in_hw id 5240 tag e8dd636632d29046 jited
The results of my experiments were non-deterministic, but I think you will reliably reproduce obviously inconsistent installation patterns if you restart Retina and Cilium pods one after another on a node of your choice.
I think you guys are going to need a good integration testing strategy to make this work in a CNI-agnostic way. I know, for example, that Retina creates the clsact qdisc with the NLM_F_EXCL
netlink flag, which is supposed to keep the qdisc in place if it already exists, but there is no guarantee Cilium won't replace said qdisc upon startup, for example (I'm not sure if this would actually affect attached filters, but it might). What is also important is the Cilium's default BPF priority is 1, and that definitely causes Retina's filters to be overwritten, and vice versa. The BPF priority setting is relatively obscure and might not be available to users of managed Cilium. I think there needs to be a strategy on making Retina work with typical CNIs, some integration tests, and documentation on best practices around this topic.
This issue is seen in both AKS and GCP. See notes for AKS at https://github.com/microsoft/retina/issues/252#issuecomment-2047419120
Describe the bug Upon installation of Retina, connectivity can be lost for pods in a GKE cluster using managed Cilium.
To Reproduce
Go to create a standard GKE cluster.
Select the
Standard: You manage your cluster
option (see screenshot 1).Specify GKE version
1.26.11-gke.105500
in theNo channel
channel selector (see screenshot 2). We suspect the issue would occur with other versions too, but we used a specific one for reproducability.[Optional] Configure the cluster to run in one AZ with fewer nodes than the default to manage cost.
[Important] In the
Networking
configuration tab for the entire cluster, selectEnable Dataplane V2
to enable managed Cilium-powered networking.Create the cluster and wait for all default pods in the cluster to come up.
Install Retina and wait for the agent pods to start.
Note: if you are running a cluster with small nodes, you might need to manually edit the retina-agent DaemonSet to lower resource requests. Wait until retina-agent pods start.
Identify
metrics-server
running in thekube-system
namespace and check its logs. You will see error logs such asIdentify the cluster IP and the endpoint IP:
Connect to another pod and check connectivity to these origins. You'll see that there is connectivity to the endpoint IP but not to the service IP.
Expected behaviour No connectivity impact when installing Retina.
Screenshots Step (2). Select
Standard: You manage your cluster
.Step (3).
No channel
when specifying the version, then specify version1.26.11-gke.1055000
.Step (4). Select
Enable Dataplane V2
in the cluster network configuration tab.Platform (please complete the following information): See steps to reproduce.
Additional context N/A