projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.89k stars 1.31k forks source link

Switching to eBPF results in calico-kube-controller not being able to reach datastore - localhost:6443 - connection refused #9141

Open BloodyIron opened 1 month ago

BloodyIron commented 1 month ago

Expected Behavior

I believe the expected behaviour is that the calico-kube-controllers (single pod but that's the name it has) should be able to contact the Kubernetes API Server in the same way that calico-node pods can.

Current Behavior

calico-node and calico-typha pods can connect to the kubernetes API via localhost:6443 successfully, but calico-kube-controllers (the name of a singular pod in this case) gets connection refused for reasons not yet clear.

Possible Solution

I am not sure what the solution is just yet as I have been following the documentation

Steps to Reproduce (for bugs)

Following documentation here: https://docs.tigera.io/calico/3.27/operations/ebpf/enabling-ebpf#configure-calico-to-talk-directly-to-the-api-server And using the Calico provisioned by Rancher when creating the cluster, self-hosted, RKE2, on Ubuntu VMs, no public cloud present at all.

  1. Disable kube-proxy before following the eBPF documentation linked above. On each node in the cluster, use a daemonset to put a file (99-rke2-customisations.yaml) in /etc/rancher/rke2/config.yaml.d/ (documentation here: https://docs.rke2.io/install/configuration#configuration-file ), and in that daemonset remove the file "/var/lib/rancher/rke2/agent/pod-manifests/kube-proxy.yam" if detected so that the kube-proxy pods don't come back. The customisations file has the declaration of "disable-kube-proxy: "true"" as per RKE2 documentation.
  2. Define the configmap as defined in the "Operator" step for preparing for eBPF Host=localhost, Port=6443. Watch all the calico-node pods reinit with the new API config environment variables, and work successfully.
  3. Then patch "kind: Installation" with "name: default" to add spec.calicoNetwork: "linuxDataplane: BPF", which then gives us operational eBPF capabilitiesd (and in-turn correct SourceIP values)
  4. Then patch "kind: FelixConfiguration" with "name: default" to add spec values "bpfKubeProxyIptablesCleanupEnabled: true", "featureDetectOverride: "ChecksumOffloadBroken=false"", "pfExternalServiceMode: "DSR"", & "xdpEnabled: true" (I have tried with and without these Felix customisations, it does not seem to impact the behaviour of calico-kube-controllers

Context

So when I look at the YAML for calico-kube-controllers it is given the same environment variables KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT that the calico-node & calico-typha pods are given. The calico-kube-controllers pod has an IP of 10.42.790.200 (and changes with re-init of course). And I am using all "default" CIDR configurations that RKE2 delivers. So no CIDR customsiations have been made by me. UFW and AppArmor are turned off on the Ubuntu hosts.

When I instead use the VIP of 10.43.0.1 for KUBERNETES_SERVICE_HOST this prevents a rebooted k8s node from fully recovering as nothing can actually talk to the cluster. Namely calico-node pods cannot reach that VIP after a node reboot, so nothing comes up. I was only able to correct this when using "localhost" or "127.0.0.1" (except localhost might be preferable due to contextual agility when used).

Your Environment

==== The error that calico-kube-controller spits out is:

[INFO][1] main.go 131: Ensuring Calico datastore is initialized 2024-08-16T16:09:33.924977314Z 2024-08-16 16:09:33.924 [ERROR][1] client.go 295: Error getting cluster information config ClusterInformation="default" error=Get "https://localhost:6443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp [::1]:6443: connect: connection refused 2024-08-16 16:09:33.924 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://localhost:6443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp [::1]:6443: connect: connection refused 2024-08-16T16:09:38.935878325Z 2024-08-16 16:09:38.934 [ERROR][1] client.go 295: Error getting cluster information config ClusterInformation="default" error=Get "https://localhost:6443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp [::1]:6443: connect: connection refused

==== I am unsure what I should be doing here as I have followed the documentation, and have not found any relevant resources online on what I can do about this. So I really would appreciate help on this matter.

BloodyIron commented 1 month ago

Also this is what it looks like when I curl localhost:6443 on one of the nodes, at the Ubuntu OS level:

curl --insecure https://localhost:6443 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "Unauthorized", "reason": "Unauthorized", "code": 401 }

So I don't really believe any firewall aspects are blocking 6443 on the nodes.

caseydavenport commented 1 month ago

but calico-kube-controllers (the name of a singular pod in this case)

Just as a side-note, this is because the single pod contains multiple distinct controllers running within it :smile:

I think the main issue here is that "localhost" resolves to a different address for different pods. calico/node and calico/typha run in the host network namespace, so localhost for them will resolve to the actual node's localhost.

However, kube-controllers runs with it's own network namespace (i.e., hostNetwork false) and so localhost within that pod will resolve to the pod's local IP, and traffic won't go to the node itself.

Do you have some sort of proxy running on localhost? Or is this just a single node cluster with the API server running on the node? In the latter case, you probably can just use the real IP of the node hosting the API server instead of using localhost.

BloodyIron commented 1 month ago

but calico-kube-controllers (the name of a singular pod in this case)

Just as a side-note, this is because the single pod contains multiple distinct controllers running within it 😄

I think the main issue here is that "localhost" resolves to a different address for different pods. calico/node and calico/typha run in the host network namespace, so localhost for them will resolve to the actual node's localhost.

However, kube-controllers runs with it's own network namespace (i.e., hostNetwork false) and so localhost within that pod will resolve to the pod's local IP, and traffic won't go to the node itself.

Do you have some sort of proxy running on localhost? Or is this just a single node cluster with the API server running on the node? In the latter case, you probably can just use the real IP of the node hosting the API server instead of using localhost.

I'm going to do my best to answer these and future questions, but there may be areas where I think I know what's going on, but might not be 100% accurate. I am still pretty green behind the ears for k8s, and I am using Rancher (v2.7.6) to provision RKE2 nodes for this test cluster, just to be clear. So feel free to correct me wherever you see fit :) So much more to learn! But...

  1. I used localhost because it seemed to get me the "farthest" vs the actual VIP for Kubernetes API (10.43.0.1), but I can completely appreciate "localhost" depends on context, and I don't YET have any worker-only nodes in said test cluster.
  2. This cluster is 3x VMs running RKE2 on each, so 3x node cluster, all with etcd/control-plane/worker roles just to have all the quorums (bigger numbers of quorums is better, right? ;P)
  3. So far as I KNOW I do not have any proxy on local host. What I can say is my desired intent is to have kube-proxy disabled (which I am successful at doing) and using eBPF in that mode. And in my testing (and referenced documentation) it seems disabling kube-proxy before defining the "name: kubernetes-services-endpoint" ConfigMap which sets the "localhost"/IP/port/whatever is the successful order of operations, suggesting no proxy is at play (I THINK???).
  4. I believe I saw api-server pods on all 3x of the nodes in this cluster, but I don't know yet if that will be the case for any worker-only nodes I add in the future (haven't gotten that far in development yet, sorry!) So I have a hunch "localhost" might not be what I want in the long-term.
  5. You might have missed me mentioning above that I've tried with the VIP for Kubernetes API (10.43.01) and that being very unsuccessful for me, for reasons I don't see why. (with and without kube-proxy present, by the way, if I remember accurately).
  6. I do appreciate your engagement and help here, so thanks!
  7. I am looking to upgrade my rancher to v2.8.5 and then the test k8s nodes to uhh... 1.28? (I forget the exact version right now) as I have whispers in the wind that might magically solve my "can't reach VIP" problem.
caseydavenport commented 1 month ago

I used localhost because it seemed to get me the "farthest" vs the actual VIP for Kubernetes API

Gotcha, yep this makes sense but will break down even more once you add nodes that are worker-only (and in fact, breaks down even before that as evidenced by this issue!).

So far as I KNOW I do not have any proxy on local host

Yep, I think my question was answered - if you haven't set up anything explicitly to redirect localhost:6443->apiserver:6443 on another host, which is sounds like you haven't.

believe I saw api-server pods on all 3x of the nodes in this cluster, but I don't know yet if that will be the case for any worker-only nodes I add in the future

They won't be - the apiserver is a control-plane only component and won't run on worker nodes. However, services on the worker nodes (including kube-proxy when enabled, kubelet, and Calico) all need to communicate with the apiserver.

You might have missed me mentioning above that I've tried with the VIP for Kubernetes API

Yeah, I am not too surprised this doesn't work. Typically I would expect to point KUBERNETES_SERVICE_HOST to the address of a load balancer fronting the API server (I suspect you have one, e.g., the address your local kubectl uses to reach the apiservers?)

tomastigera commented 1 month ago

You can use one of the api server endpoints in kubectl get endpoints kubernetes

caseydavenport commented 1 month ago

You can use one of the api server endpoints in kubectl get endpoints kubernetes

Yep, this is a decent stopgap but it won't provide redundancy in the event of that particular API server pod failing / being upgraded, nor will it handle the IP address of that particular node changing.

BloodyIron commented 1 month ago

I used localhost because it seemed to get me the "farthest" vs the actual VIP for Kubernetes API

Gotcha, yep this makes sense but will break down even more once you add nodes that are worker-only (and in fact, breaks down even before that as evidenced by this issue!).

So far as I KNOW I do not have any proxy on local host

Yep, I think my question was answered - if you haven't set up anything explicitly to redirect localhost:6443->apiserver:6443 on another host, which is sounds like you haven't.

believe I saw api-server pods on all 3x of the nodes in this cluster, but I don't know yet if that will be the case for any worker-only nodes I add in the future

They won't be - the apiserver is a control-plane only component and won't run on worker nodes. However, services on the worker nodes (including kube-proxy when enabled, kubelet, and Calico) all need to communicate with the apiserver.

You might have missed me mentioning above that I've tried with the VIP for Kubernetes API

Yeah, I am not too surprised this doesn't work. Typically I would expect to point KUBERNETES_SERVICE_HOST to the address of a load balancer fronting the API server (I suspect you have one, e.g., the address your local kubectl uses to reach the apiservers?)

Right now all 3x of the nodes in the cluster are etcd/control-plane/worker though. Just to clarify ;)

BloodyIron commented 1 month ago

You can use one of the api server endpoints in kubectl get endpoints kubernetes

How do I:

  1. Make it so that auto-updates as control-plane nodes are added/removed from a cluster?
  2. Tolerant of any of the control-plane nodes eating dirt/rebooting/taking a lunch break/staring into an eclipse/something else?

(Directed at anyone:) To me, the point of the 10.43.0.1 VIP... wasn't it supposed to be the logical "loadbalanced" IP meeting this function of tolerant of endpoint shenanigans? Which is why it seemed the logical IP to use, yet... doesn't work? (and I know this is a default-generated IP for my cluster, I'm not married to this specific IP but automation is nice)

caseydavenport commented 1 month ago

To me, the point of the 10.43.0.1 VIP... wasn't it supposed to be the logical "loadbalanced" IP meeting this function of tolerant of endpoint shenanigans?

It is. However, when using eBPF Calico, Calico becomes responsible for programming that VIP so that it works. So, Calico can't rely on the VIP that it itself is programming.

Where are you running kubectl from? Are you accessing this cluster by SSHing into the nodes? Typically a cluster will have a public IP address associated with its API that is used for external access (i.e., a cloud LoadBalancer) - this should handle nodes being added/removed, as well as rolling updates, etc.

BloodyIron commented 1 month ago

To me, the point of the 10.43.0.1 VIP... wasn't it supposed to be the logical "loadbalanced" IP meeting this function of tolerant of endpoint shenanigans?

It is. However, when using eBPF Calico, Calico becomes responsible for programming that VIP so that it works. So, Calico can't rely on the VIP that it itself is programming.

Where are you running kubectl from? Are you accessing this cluster by SSHing into the nodes? Typically a cluster will have a public IP address associated with its API that is used for external access (i.e., a cloud LoadBalancer) - this should handle nodes being added/removed, as well as rolling updates, etc.

I'm generally avoiding any sort of manual kubectl at all. I always try to seek a method that ArgoCD can apply via YAML manifests so that any changes I want to make are defined IaC-style. So disabling kube-proxy, for example, uses a daemonset that drops a yaml file in a location RKE2 looks for, which declares an Environment Variable to disable kube-proxy, and also removes another manifest file for kube-proxy in another folder.

As for the "name: kubernetes-services-endpoint" that's a "kind: ConfigMap" manifest that ArgoCD applies to the cluster.

ArgoCD compares running state to a GitLab repo, by the way.

I try to avoid getting manually onto the nodes, but for when I have to, it's via SSH. Or for kubectl stuff I use the Rancher's webGUI to get me to kubectl for the cluster (and I don't typically need to care where kubectl runs, so long as it can reach the relevant cluster, which it normally can).

This is all self-hosted by the way, and I do believe I have things configured to go through Rancher for when ArgoCD interacts with the cluster. I intentionally did not configure an endpoint (unsure if this is Kubernetes API or not, to be clear) as I wanted Rancher to manage that Access Control/RBAC stuff.

As for inbound traffic, I'm using MetalLB in Layer 2 ARP mode handling a single LAN IP for inbound traffic, but I'm quite confident that doesn't overlap with where I'm stuck.

So again, all on-prem, my infra, no hosted cloud, nothing like that. ;) This is by design and I have no interest in doing any of this in any hosted infra cloud or otherwise.

caseydavenport commented 1 month ago

I'm generally avoiding any sort of manual kubectl at all.

Right - I was less interested in kubectl the tool in particular, and more interested in learning how entities outside of your cluster access the API (if at all).

Ultimately what you need is a stable IP address that routes to your API server pod(s). Given you are running your own on-prem cluster, it's sort of up to you how you configure that!

tomastigera commented 1 month ago

Or if you have a way how to resolve DNS from the hosts, you can use a domain name in KUBERNETES_SERVICE_HOST that maps to multiple IPs for HA. I think Azure does it that way.

BloodyIron commented 1 month ago

I'm generally avoiding any sort of manual kubectl at all.

Right - I was less interested in kubectl the tool in particular, and more interested in learning how entities outside of your cluster access the API (if at all).

Ultimately what you need is a stable IP address that routes to your API server pod(s). Given you are running your own on-prem cluster, it's sort of up to you how you configure that!

Of course it's up top me, but I have no idea how I should configure it, as the official documentation for eBPF in Calico (specifically for RKE2 agnostic of where, by the way) isn't producing the results described in said documentation.

I do see that 10.43.0.1 exists in the cluster, but using that instead of localhost or 127.0.0.1 works "worse" as all Calico aspects fail to re-init when using 10.43.0.1.

So I really do need more help on the matter.

BloodyIron commented 1 month ago

Or if you have a way how to resolve DNS from the hosts, you can use a domain name in KUBERNETES_SERVICE_HOST that maps to multiple IPs for HA. I think Azure does it that way.

I'm not opposed to something like that, however I would want some sort of method that auto-updates itself with adding/removing nodes (in this case control-plane) to said DNS entry. Shouldn't some of the internal FQDNs work for this function?

It's also challenging for me to determine if the relevant pods care about internal (within the k8s cluster) DNS resolution for this aspect, or require external (outside the cluster, but on the VM itself maybe) resolution for such a method to work?

I'm trying to keep the cluster as self-sustaining, and automated, as possible.

And to be clear to both @caseydavenport and @tomastigera I do REALLY appreciate the engagement and help here, even if we haven't quite found a solution yet. So thank you for that ❤️

BloodyIron commented 1 month ago

Also, isn't Calico capable of doing this kube-apiserver loadbalancing internally or something that we "need" here? I'm trying to keep my details straight, but it is challenging...

caseydavenport commented 1 month ago

Also, isn't Calico capable of doing this kube-apiserver loadbalancing internally

The problem here is bootstrapping - Calico needs to be able to talk to the apiserver in order to learn the necessary information in order to do this.

So, Calico can't just magically detect where the API server is when it doesn't have access to the API. Something needs to tell Calico where the API server is so that it can set up that load balancing for other pods.

BloodyIron commented 1 month ago

Also, isn't Calico capable of doing this kube-apiserver loadbalancing internally

The problem here is bootstrapping - Calico needs to be able to talk to the apiserver in order to learn the necessary information in order to do this.

So, Calico can't just magically detect where the API server is when it doesn't have access to the API. Something needs to tell Calico where the API server is so that it can set up that load balancing for other pods.

And is this from an internal-to-cluster perspective, or external-to-cluster perspective? (as in where I "want" it connecting to).

I ask because a few things I've discovered may be relevant.

  1. I tried setting KUBERNETES_SERVICE_HOST to "kubernetes.default.svc.cluster.local", and the cni-installer container for calico-node reached out to my internet gateway (way outside the k8s cluster) for DNS resolution and the logs produces "no such host" (mentioning the IP of the gateway:53). Soooo I'm not sure if this is demonstrating more of the catch22 stuff or what... How is it when I use 127.0.0.1/localhost for KUBERNETES_SERVICE_HOST all the pods work (except calico-cube-controllers)? Is it because those working pods init from an external connection context?
  2. A rando online solution in I think a similar scenario mentioned their solution was creating an Ingress object to the Kubernetes Service object. And considering I'm using MetalLB in Layer 2 ARP mode, I was seriously considering trying this method, and then stuffing whatever FQDN the Ingress would use in the KUBERNETES_SERVICE_HOST value. However I'm unsure what FQDN I should use, and how well this would behave when all nodes are rebooted (nicely or not).

I understand that it's far more typical to do k8s stuff on hosted/cloud infra, but self-hosted is really a requirement for the areas I work in, and hosted/cloud really isn't an acceptable option. So in my development of my whole cluster adventure, this last bit with Calico is like the last 1% remaining on the work.

I really hope we can figure something out, because I do quite like what I see in Calico, and I really am trying to do my best to read lots, and listen lots. And again thanks for responding and responding so rapidly. :) I apologise in advance if I come across as unappreciative or gruff in any way, past, present, or future. I'm both frustrated, and excited to get this sorted. ❤️

BloodyIron commented 1 month ago

Also... considering my circumstance... should I be enabling eBPF before (or in-parallel with) declaring the KUBERNETES_SERVICE_HOST and _PORT? (And what about disabling kube-proxy? the test above kube-proxy was running at the time)

BloodyIron commented 1 month ago

Hey so some new information...

  1. I thought it was a perfect time to try something really "stupid" because this is a testing cluster. So I started with KUBERNETES_SERVICE_HOST set to localhost, then I patched "kind: Installation"/"name: default" with "linuxDataplane: BPF" to enable eBPF, and patched "kind: FelixConfiguration" with cleaning up kube-proxy, the checmsumoffloadbroken=false flag, DSR, and enabling XDP. I then disabled kube-proxy (don't want to fully explain method in this response, sorry). Then everything was working, except calico-kube-controller was still unhappy (which I expected). I tried hard-resetting one of my nodes to see how recovery would work, and things came up "just fine" (again, except for calico-kube-controller). Once I felt like it was working "enough" I changed KUBERNETES_SERVICE_HOST to one of the LAN IPs for one of the k8s nodes (as in IP external to the cluster) and uhhhh.... EVERYTHING WORKED (including calico-kube-controller). I even tried brutally hard-restarting the same k8s node to see how recovery behaved and... uhhh... everything came back up, nothing is broken once init finished.... I don't think this is the "permanent" way I want this to run, but yeah pretty sure this falls in-line with what's been presented above. Yikes!
  2. I'm probably going to try rolling out kube-vip for the external api-server VIP loadbalancing stuff, as the below referenced reddit thread has multiple people singing the praise in ways that look generally identical to what I'm doing.

Reddit thread: https://www.reddit.com/r/kubernetes/comments/1epn6jo/best_approach_to_expose_an_onpremise_k8s_cluster/

BloodyIron's optimism grows.

caseydavenport commented 1 month ago

I tried setting KUBERNETES_SERVICE_HOST to "kubernetes.default.svc.cluster.local"

DNS resolution using cluster DNS won't work until Calico is running - another part of the "catch 22" situation. I think the CNI plugin is using your node's DNS configuration, which is how it ended up hitting your gateway.

How is it when I use 127.0.0.1/localhost for KUBERNETES_SERVICE_HOST all the pods work (except calico-cube-controllers)? Is it because those working pods init from an external connection context?

calico-node, calico-typha and the CNI plugin all run on the host - localhost resolves to the IP address of the API server running on that node. You should see this if you kubectl get pods -o wide to view those pod's IP addresses - it should match the apiserver address.

calico-kube-controllers does not run in the host's network namespace. localhost resolves to its own IP address, which is not the same as the API server address.

And is this from an internal-to-cluster perspective, or external-to-cluster perspective? (as in where I "want" it connecting to).

I'm not entirely sure I understand this question, but Calico wants to connect to an IP address or domain name that can successfully reach the API server without relying on cluster networking being up (i.e., cluster DNS, cluster service implementation, etc), because Calico is responsible for bootstrapping all of those things.

I apologise in advance if I come across as unappreciative or gruff in any way, past, present, or future. I'm both frustrated, and excited to get this sorted.

Not at all :)

Also... considering my circumstance... should I be enabling eBPF before (or in-parallel with) declaring the KUBERNETES_SERVICE_HOST and _PORT? (And what about disabling kube-proxy? the test above kube-proxy was running at the time)

I would configure the env vars, and disable kube-proxy prior to launching Calico altogether.

To the best of my knowledge, you want something like one of the below options:

caseydavenport commented 1 month ago

I don't think this is the "permanent" way I want this to run, but yeah pretty sure this falls in-line with what's been presented above. Yikes!

Glad you got something working!

I think the main takeaway here is using KUBERNETES_SERVICE_HOST=<real IP of control-plane node> works because your nodes can reach that address even before Calico is installed, and it works for calico-kube-controllers because it's a "real" IP in the network and not a resolution that varies based on where it is being checked.

And frankly, this should be fine - the only downside really is that you're not getting the full benefit of running multiple API servers (i.e., if you need to do a rolling update, there won't be failover and Calico will just wait for the node you specified to come back online).

You probably don't need to go through the intermediate steps of using localhost and restarting things, though.

BloodyIron commented 1 month ago
  1. Thanks for clarifying on the network namespace aspect for calico-kube-controllers. I suspected it was something like that, but wasn't having a good time finding consistent evidence to support/refute that.
  2. I'm using Rancher to provision and manage the RKE2-based clusters, so there's really no state where I am "prior to launching Calico altogether". As for specific order-of-operations in provisioning the cluster and such, well I think I have an idea how that might look, but I'm not yet sure, and that's why I'm taking far too many notes and leveraging Argo+GitLab so I can go forwards/backwards in time to work towards having that fully known (order of operations for provisioning, etc). The "localhost" aspect I wasn't really considering as intermediary for provisioning, but more for the purpose of experimentation and learning. Trying that value, watching the outcome, poking the things in sensitive spots with pointy things and going "ooo" and "ahhh" then jotting notes down. It never seemed sustainable.
  3. Yeah you are confirming what I suspected in terms of why it is succeeding, and I already anticipated the limitations of using only one control-plane assigned node's IP. I more did it to see if it would break in ways I didn't anticipate (and it didn't). I really do think the kube-vip thing has a substantially high probability of success for solving the matter we're talking about here, considering that's what it says on the tin as a solution multiple times (and who doesn't love tinned food?).
  4. Maybe if I figure this out, it might be prudent for the Calico documentation to include said solution to help $futureHumans? But saying that not trying to put the egg before the horse's cart.
caseydavenport commented 4 weeks ago

Would love to have better documentation on this! Especially if kube-vip is a viable option, documenting the steps you took to integrate with that would be awesome.

BloodyIron commented 4 weeks ago

Would love to have better documentation on this! Especially if kube-vip is a viable option, documenting the steps you took to integrate with that would be awesome.

I already am aspiring to do my best on documenting such things! Once I figure out this last bit I am probably going to do a yuge publication (or multiples?) on my own personal bloggy/articley site to get eCred hehe. But I do suspect that the Calico docs (and Rancher/RKE2 docs?) would probably benefit from such insights too :D Yay!

Oh and still feel free to correct me on any inaccuracies/misunderstandings/other details in general. Still lots to learn about k8s ;D