Retina can break connectivity of pods to the Kubernetes Cluster IP in clusters using Cilium

andreev-io commented 5 months ago

This issue is seen in both AKS and GCP. See notes for AKS at https://github.com/microsoft/retina/issues/252#issuecomment-2047419120

Describe the bug Upon installation of Retina, connectivity can be lost for pods in a GKE cluster using managed Cilium.

To Reproduce

Go to create a standard GKE cluster.
Select the Standard: You manage your cluster option (see screenshot 1).
Specify GKE version 1.26.11-gke.105500 in the No channel channel selector (see screenshot 2). We suspect the issue would occur with other versions too, but we used a specific one for reproducability.
[Optional] Configure the cluster to run in one AZ with fewer nodes than the default to manage cost.
[Important] In the Networking configuration tab for the entire cluster, select Enable Dataplane V2 to enable managed Cilium-powered networking.
Create the cluster and wait for all default pods in the cluster to come up.

Install Retina and wait for the agent pods to start.

> VERSION=$( curl -sL https://api.github.com/repos/microsoft/retina/releases/latest | jq -r .name)
helm install retina oci://ghcr.io/microsoft/retina/charts/retina \
--set namespace=kube-system \
--version $VERSION \
--namespace kube-system \
--set image.tag=$VERSION \
--set operator.tag=$VERSION \
--set image.pullPolicy=Always \
--set logLevel=info \
--set operator.enabled=true \
--set operator.enableRetinaEndpoint=true \
--set enabledPlugin_linux="\[packetparser\]" \
--set enablePodLevel=true \
--set remoteContext=true

Note: if you are running a cluster with small nodes, you might need to manually edit the retina-agent DaemonSet to lower resource requests. Wait until retina-agent pods start.

Identify metrics-server running in the kube-system namespace and check its logs. You will see error logs such as

E0409 15:21:23.378785       1 webhook.go:202] Failed to make webhook authorizer request: Post "https://10.114.192.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": context canceled
E0409 15:21:23.378851       1 errors.go:77] Post "https://10.114.192.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": context canceled

Identify the cluster IP and the endpoint IP:

> kubectl get service
NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.114.192.1   <none>        443/TCP   45m
> kubectl get ep     
NAME         ENDPOINTS        AGE
kubernetes   10.128.0.7:443   45m

Connect to another pod and check connectivity to these origins. You'll see that there is connectivity to the endpoint IP but not to the service IP.

> kubectl debug -ti --image="nixery.dev/shell/curl" kube-dns-ff4bbcc87-tvzm7 -n kube-system
bash-5.2# curl https://10.114.192.1 -v -k
...
bash-5.2# curl https://10.128.0.7 -v -k
*   Trying 10.128.0.7:443...
* Connected to 10.128.0.7 (10.128.0.7) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=34.173.138.225
*  start date: Apr  9 14:52:44 2024 GMT
*  expire date: Apr  8 14:54:44 2029 GMT
*  issuer: CN=ca353e3b-048b-4feb-aa93-19a7c8a6aa89
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://10.128.0.7/
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: 10.128.0.7]
* [HTTP/2] [1] [:path: /]
* [HTTP/2] [1] [user-agent: curl/8.4.0]
* [HTTP/2] [1] [accept: */*]
> GET / HTTP/2
> Host: 10.128.0.7
> User-Agent: curl/8.4.0
> Accept: */*
> 
* received GOAWAY, error=0, last_stream=1
< HTTP/2 403 
< audit-id: 2c7f6280-d595-4ddf-850f-abf1cadd85d8
< cache-control: no-cache, private
< content-type: application/json
< x-content-type-options: nosniff
< x-kubernetes-pf-flowschema-uid: 759447f6-3823-412a-86a3-09c764ef91eb
< x-kubernetes-pf-prioritylevel-uid: 2707b41b-d15c-402a-a039-b0df8aff1c2d
< content-length: 217
< date: Tue, 09 Apr 2024 15:45:36 GMT
< 
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
"reason": "Forbidden",
"details": {},
"code": 403
* Closing connection
* TLSv1.3 (OUT), TLS alert, close notify (256):

Expected behaviour No connectivity impact when installing Retina.

Screenshots Step (2). Select Standard: You manage your cluster.

Step (3). No channel when specifying the version, then specify version 1.26.11-gke.1055000.

Step (4). Select Enable Dataplane V2 in the cluster network configuration tab.

Platform (please complete the following information): See steps to reproduce.

Additional context N/A

andreev-io commented 5 months ago

We reproduced an identical issue in AKS. We created a cluster using managed Cilium:

az group create --name ilya-experiments --location uksouth
az network vnet create -g ilya-experiments --location uksouth --name ilya-vnet --address-prefixes 10.0.0.0/8 -o table
az network vnet subnet create -g ilya-experiments --vnet-name ilya-vnet --name nodesubnet --address-prefixes 10.240.0.0/16 -o none
az network vnet subnet create -g ilya-experiments --vnet-name ilya-vnet --name podsubnet --address-prefixes 10.241.0.0/16 -o none
az aks create -n ilya-experiments -g ilya-experiments -l uksouth --max-pods 250 --network-plugin azure --vnet-subnet-id /subscriptions/<>/resourceGroups/ilya-experiments/providers/Microsoft.Network/virtualNetworks/ilya-vnet/subnets/nodesubnet --pod-subnet-id /subscriptions/<>/resourceGroups/ilya-experiments/providers/Microsoft.Network/virtualNetworks/ilya-vnet/subnets/podsubnet --network-dataplane cilium

Installed Retina like described above and observed the same symptoms.

rbtr commented 5 months ago

While we start to repro and root-cause this, any additional datapoints from others running on EKS/GKE and/or with Cilium are welcome

andreev-io commented 5 months ago

@rbtr @anubhabMajumdar

Hey! Cilium uses the following code to load its programs on the ingress qdisc:

func replaceQdisc(link netlink.Link) error {
    attrs := netlink.QdiscAttrs{
        LinkIndex: link.Attrs().Index,
        Handle:    netlink.MakeHandle(0xffff, 0),
        Parent:    netlink.HANDLE_CLSACT,
    }

    qdisc := &netlink.GenericQdisc{
        QdiscAttrs: attrs,
        QdiscType:  qdiscClsact,
    }

    return netlink.QdiscReplace(qdisc)
}

And Retina uses the following:

qdiscIngress = &tc.Object{
        Msg: tc.Msg{
            Family:  unix.AF_UNSPEC,
            Ifindex: uint32(iface.Index),
            Handle:  helper.BuildHandle(0xFFFF, 0x0000),
            Parent:  tc.HandleIngress,
        },
        Attribute: tc.Attribute{
            Kind: "clsact",
        },
    }

Wouldn't the conflict in the handle and parent of qdisc explain the symptoms we are seeing here?

wenhuwang commented 5 months ago

I encountered a similar issues in the cilium environment. When retina-agent is restarted or shut down, the pod network of the cluster cannot be connected from the outside. And it was caused by the packetparser plugin. The environment configuration is as follows:

OS: Ubuntu 18.04.5 LTS
Kernel Version: 5.10.87-051087-generic
Kubernetes Version: 1.22.2
Cilium Version: v1.11.5
retina-agent: v0.0.2

anubhabMajumdar commented 5 months ago

@andreev-io I think you are onto something. Issue seems to be caused by the Packetparser plugin and the way we setup qdisc to observe traffic. Restarting the cilium pods doesn't solve the issue, requires uninstalling retina and then restarting cilium.

pchaigno commented 5 months ago

:wave: Cilium committer here

From taking a quick look at a cluster with Cilium and Retina, I think @andreev-io is spot on! We've had this issue before, with Datadog's agent (cf. https://github.com/cilium/cilium/issues/21345). One solution is to (1) change Retina's BPF programs so they don't bypass subsequent BPF programs and (2) explicitly tell Cilium's BPF programs to run second. I've sent https://github.com/microsoft/retina/pull/276 for the first. The second requires running Cilium with --bpf-filter-priority=2.

This LinuxPlumbers presentation by my colleague has more information on this issue and the longer-term upstream solution we're going for (given more and more people are using BPF and need to play nice together).

Note the solution I'm proposing assumes that you want to run Retina before Cilium. I'm making that assumption because I guess you want to see packets before Cilium has a change to drop or mangle (ex. NAT) them?

andreev-io commented 5 months ago

@pchaigno This is a wonderful discovery! I am going to watch the presentation you linked as soon as I have some time.

For our use case, we are interested in Retina running after Cilium, since we want to measure traffic after it's filtered by Cilium and CiliumNetworkPolicies are applied. How can you instrument tc to execute programs in a deterministic order, is there a priority? I can open a PR to make this configurable for Retina.

rbtr commented 5 months ago

I will defer to @pchaigno but I suspect we're going to be looking at replicating https://github.com/cilium/cilium/issues/17193.

This is all new to me, so I am wondering: since TCFilterPriority is uint, is 1 (or 0 maybe) the highest possible priority? I'm thinking about it in context of allowing the order of Cilium and Retina (or other) to be adjusted without having to modify the config of both processes? Is there a downside to, say, always deploying Cilium at priority 10 so that programs could be added in priority before or after it without reconfiguring and restarting Cilium also?

anubhabMajumdar commented 5 months ago

@pchaigno @andreev-io Thanks for bringing this up and helping us fix this quickly!

Couple of things we are thinking of doing to make Retina work with Cilium (and other ebpf based services) out-of-the box:

Since Retina is for observability and we don't want to break high priority qdiscs shaping traffic, lower the priority of the packet handlers of Retina
Don't hardcode the handler ID for the Packetparser in code, but let that be set randomly during runtime

pchaigno commented 5 months ago

Running after Cilium is going to be tricky because our BPF programs don't return TC_ACT_UNSPEC :sweat_smile: There are multiple reasons to that in my opinion:

AFAIK, nobody asked for that before :slightly_smiling_face: Other BPF software typically wanted to sit before Cilium. Being after Cilium would be a tricky situation. How would you know what you're seeing? Is it a NATed packet? Are you seeing the full traffic/connection? Is it encapsulated?
Cilium very often doesn't just pass packets to the stack. Even when we don't drop them, we tend to redirect with various helpers to improve performance.
If the use case is to inspect Cilium's actions specifically, then you will probably get asked why Cilium's own tooling (cilium monitor, Hubble, custom calls, etc.) doesn't answer your needs.

That being said, there's no reason that behavior couldn't be changed (likely behind a flag). I can't talk for all committers, but we're typically quite open to new contributions if there's any use case for them.

andreev-io commented 5 months ago

@pchaigno That's very interesting and insightful.

We are planning to use Retina in our cloud platform for network metering for billing purposes. The other two major options we evaluated were Hubble, which doesn't work for us because it doesn't meet the requirement of collecting volumetric data (bytes-per-second and packets-per-second) with rich context (Kubernetes labels and annotations such as source/destination pod names) – its metrics cover one at a time but never both, and Cilium custom calls, which don't work for us because our platform in Azure and GCP uses managed Cilium where we don't have control over whether the custom calls functionality is enabled (we have not actually tested this yet, since this approach would involve writing our own eBPF programs, and we want to first push Retina as far as possible).

My initial thought process as to where to put Retina in the packet processing pipeline with respect to Cilium was that our metering should run after all filtering happens for the obvious reason that we shouldn't bill on packets that get dropped by the Cilium dataplane.

However, you are correct that the packets we will see after Cilium might not be sensible or be encapsulated or just be plain invisible.

The more I think about this problem, the more I'm leaning toward metering before Cilium and then making sure the billing calculation is based only on connections that are certain to be correctly routed. Still, I'm very interested in your perspective on this problem. Is there a way to solve for this with Hubble?

rbtr commented 5 months ago

@andreev-io I'm wondering if you consider this issue resolved by #276 which does fix Retina and Cilium being totally incompatible. If we consider this specific issue resolved, I think there are follow-up asks to Retina/to Cilium to make it customizable in which order they are required to run?

pchaigno commented 5 months ago

@andreev-io Thanks for the explanation and context! It helps and makes a lot of sense to me.

The more I think about this problem, the more I'm leaning toward metering before Cilium and then making sure the billing calculation is based only on connections that are certain to be correctly routed. Still, I'm very interested in your perspective on this problem. Is there a way to solve for this with Hubble?

I don't think we have a solution ready for this in Cilium (though it's starting to be big enough that I could have missed it). It however sounds like something we could have solved with Tetragon. It's probably best to ask in the Cilium Slack to be sure to reach my Tetragon colleagues.

andreev-io commented 5 months ago

Hey @rbtr. I have been testing this quite extensively, and I don't think the problem is fully solved. I did a deep dive into both Cilium and Retina code and played around with them with default tc priority settings and adjusted tc priority settings (as per @pchaigno's suggestion in their PR).

What I'm observing is that when you configure Cilium to run with lower priority (e.g. 5 in my example below), sometimes both Cilium and Retina get installed correctly. More often than not, however, one overwrites the other. Most of the time, I observe that if I restart a Retina agent, Cilium programs from_container and from_netdev get overwritten completely and disappear; vice versa, Cilium sometimes overwrites Retina's endpoint_ingress, endpoint_egress, host_ingress, host_egress.

Occasionally a combination of programs gets loaded, but inconsistently – for example, below is one of my observations where Retina coexisted with Cilium on the ingress hook of a container veth and on the ingress hook of the eth0 interface, but did not get installed on the same veth's egress hook at all (Cilium is not expected to be there since Cilium does not attach to the veth's egress hook by default):

[root@ip-192-168-46-54 /]# tc filter show dev lxc6032e3cd1c24 ingress
filter protocol all pref 1 bpf chain 0 
filter protocol all pref 1 bpf chain 0 handle 0x1 endpoint_ingres direct-action not_in_hw id 4923 tag 7313acb249e3b164 jited 
filter protocol all pref 5 bpf chain 0 
filter protocol all pref 5 bpf chain 0 handle 0x1 cil_from_container-lxc6032e3cd1c24 direct-action not_in_hw id 5172 tag 2a31958ad4a95e7c jited 
[root@ip-192-168-46-54 /]# tc filter show dev lxc6032e3cd1c24 egress
[root@ip-192-168-46-54 /]# tc filter show dev eth0 ingress
filter protocol all pref 1 bpf chain 0 
filter protocol all pref 1 bpf chain 0 handle 0x1 host_ingress_fi direct-action not_in_hw id 4925 tag c67b49b0a12a5098 jited 
filter protocol all pref 5 bpf chain 0 
filter protocol all pref 5 bpf chain 0 handle 0x1 cil_from_netdev-eth0 direct-action not_in_hw id 5240 tag e8dd636632d29046 jited

The results of my experiments were non-deterministic, but I think you will reliably reproduce obviously inconsistent installation patterns if you restart Retina and Cilium pods one after another on a node of your choice.

I think you guys are going to need a good integration testing strategy to make this work in a CNI-agnostic way. I know, for example, that Retina creates the clsact qdisc with the NLM_F_EXCL netlink flag, which is supposed to keep the qdisc in place if it already exists, but there is no guarantee Cilium won't replace said qdisc upon startup, for example (I'm not sure if this would actually affect attached filters, but it might). What is also important is the Cilium's default BPF priority is 1, and that definitely causes Retina's filters to be overwritten, and vice versa. The BPF priority setting is relatively obscure and might not be available to users of managed Cilium. I think there needs to be a strategy on making Retina work with typical CNIs, some integration tests, and documentation on best practices around this topic.

microsoft / retina

Retina can break connectivity of pods to the Kubernetes Cluster IP in clusters using Cilium #252