BloodyIron commented 1 week ago

Environmental Info: RKE2 Version: v1.26.15+rke2r1

The Cilium CNI installed via the helm charts (using Rancher to provision the cluster) produces cluster node pods that do not have the cilium cli application installed at all. Instead there is a symlink to cilium-dbg. The problem with this is there is diagnostic and troubleshooting functionality that I need in the cilium cli tool that is not available in the other available cilium-related commands.

I read the documentation and searched on google, and cannot find a way to get the cilium cli properly installed via this method. So can we please have this corrected? It really makes most of the cilium troubleshooting documentation completely useless since they constantly refer to troubleshooting and functions that are only provided by the proper cilium cli tool.

rbrtbnfgl commented 1 week ago

We mirrored the cilium image from the Cilium repo. Cilium-cli should be used when Cilium itself is installed using the cli and not the helmchart. What type of command do you need on the cli? I was able to use the Cilium cli building it from https://github.com/cilium/cilium-cli I just modified the chart name from the code

--- a/defaults/defaults.go
+++ b/defaults/defaults.go
@@ -126,7 +126,7 @@ const (
        IngressSecretsNamespace = "cilium-secrets"

        // HelmReleaseName is the default Helm release name for Cilium.
-       HelmReleaseName               = "cilium"
+       HelmReleaseName               = "rke2-cilium"
        HelmValuesSecretName          = "cilium-cli-helm-values"
        HelmValuesSecretKeyName       = "io.cilium.cilium-cli"
        HelmChartVersionSecretKeyName = "io.cilium.chart-version"

BloodyIron commented 6 days ago

We mirrored the cilium image from the Cilium repo. Cilium-cli should be used when Cilium itself is installed using the cli and not the helmchart. What type of command do you need on the cli? I was able to use the Cilium cli building it from https://github.com/cilium/cilium-cli I just modified the chart name from the code
--- a/defaults/defaults.go
+++ b/defaults/defaults.go
@@ -126,7 +126,7 @@ const (
        IngressSecretsNamespace = "cilium-secrets"

        // HelmReleaseName is the default Helm release name for Cilium.
-       HelmReleaseName               = "cilium"
+       HelmReleaseName               = "rke2-cilium"
        HelmValuesSecretName          = "cilium-cli-helm-values"
        HelmValuesSecretKeyName       = "io.cilium.cilium-cli"
        HelmChartVersionSecretKeyName = "io.cilium.chart-version"

One quick example is checking clustermesh status, but there's some others too: https://docs.cilium.io/en/stable/operations/troubleshooting/#automatic-verification

The point of me using Rancher and RKE2 is not having to go and recompile code or rehost my own variants of the tooling presented. I don't even know how I would make that kind of a modification without compromising future Rancher/RKE2/related updates, which is a significant concern of mine.

rbrtbnfgl commented 6 days ago

Are you sure that the status command is not working? It should work I can do some tests tomorrow and I'll give you some feedback.

brandond commented 6 days ago

To rephrase what @rbrtbnfgl said

Cilium-cli should only be used when Cilium itself is installed using the cli and not the helmchart.

The cilium status can be read from various CRDs, right?

BloodyIron commented 6 days ago

I don't even see the documentation from Cilium mentioning CRDs for such troubleshooting so I would be going in blind. And yes, I am sure that the command doesnt work @rbrtbnfgl as the cilium command is a symlink to cilium-dbg which has a completely different set of capabilities and commands.

rbrtbnfgl commented 5 days ago

Are you running the cli from the cilium pod? From the Cilium docs you have to install directly from the node with https://docs.cilium.io/en/stable/operations/troubleshooting/#install-the-cilium-cli

rbrtbnfgl commented 5 days ago

I was able to enable the clustermesh and get the right status.

cilium clustermesh status
⚠  Cluster not configured for clustermesh, use '--set cluster.id' and '--set cluster.name' with 'cilium install'. External workloads may still be configured.
⚠  Service type NodePort detected! Service may fail when nodes are removed from the cluster!
✅ Service "clustermesh-apiserver" of type "NodePort" found
✅ Cluster access information is available:
  - 10.1.1.11:32379
✅ Deployment clustermesh-apiserver is ready
ℹ  KVStoreMesh is disabled

🔌 No cluster connected

🔀 Global services: [ min:-1 / avg:0.0 / max:0 ]

brandond commented 5 days ago

@rbrtbnfgl can you perhaps show an example HelmChartConfig to enable and configure clustermesh via chart values?

BloodyIron commented 5 days ago

Are you running the cli from the cilium pod? From the Cilium docs you have to install directly from the node with https://docs.cilium.io/en/stable/operations/troubleshooting/#install-the-cilium-cli

Yeah I'm not installing software on my nodes that doesn't come from a package manager source. That's just creating future problems I'm not interested in having (namely the package manager not being aware of it and never updating it).

I'm in agreement with @brandond that it seems preferable to have a helmchart config setting to enable this, or I dunno... have it being present by default instead of how it is now. Any chance we can make that happen? (I'd prefer it just be there by default)

brandond commented 5 days ago

Yeah I'm not installing software on my nodes that doesn't come from a package manager

This would be something to raise with the cilium team. We just consume their chart and images; we don't control how they package and distribute the node binaries.

I hope you also realize that the RKE2 and Cilium are both already "installing software on your nodes" by extracting binaries from images and placing them on the root fs, without using the package manager.

it seems preferable to have a helmchart config setting to enable this, or I dunno... have it being present by default

Are you talking about enabling clustermesh by default? I don't think everyone would want that enabled by default. It also requires additional configuration:

Each cluster must be assigned a unique human-readable name as well as a numeric cluster ID (1-255). It is best to assign both these attributes at installation time of Cilium: Helm options cluster.name and cluster.id

rbrtbnfgl commented 4 days ago

Ok I found a way to configure it. It wasn't so easy to configure it through helm.

Create Cluster1:

RKE2 config:

write-kubeconfig-mode: 644
cluster-cidr: "10.42.0.0/16"
service-cidr: "10.43.0.0/16"
cni: "cilium"

Cilium value config:

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-cilium
  namespace: kube-system
spec:
  valuesContent: |-
    externalWorkloads:
      enabled: true
    cluster:
      name: cluster1
      id: 1
    externalWorkloads:
      enabled: true
    clustermesh:
      useAPIServer: true
      config:
        enabled: true
        clusters:
        - name: cluster1
          ips:
          - <ip for the cluster one node>
          port: 32379

When the first Cluster starts configure the second cluster with some info from the first. You need to get the clustermesh apiserver certificate with: kubectl -n kube-system get secret clustermesh-apiserver-remote-cert -o yaml Get ca.crt, tls.crt and tls.key from the output

Configure Cluster2:

RKE2 config:

write-kubeconfig-mode: 644
cluster-cidr: "10.44.0.0/16"
service-cidr: "10.45.0.0/16"
cni: "cilium"

Cilium Config:

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-cilium
  namespace: kube-system
spec:
  valuesContent: |-
    cluster:
      name: cluster2
      id: 2
    externalWorkloads:
      enabled: true
    clustermesh:
      useAPIServer: true
      config:
        enabled: true
        clusters:
        - name: cluster2
          ips:
          - <Ip for cluster2>
          port: 32379
        - name: cluster1
          ips:
          - <ip for cluster1>
          port: 32379
          tls:
            cert: "The content of tls.crt from cluster1"
            key: "The content of tls.key from cluster1"
            caCert: "The content of ca.crt from cluster1"

When also the second cluster started get the same info that were previously taken from the first one with kubectl -n kube-system get secret clustermesh-apiserver-remote-cert -o yaml Get ca.crt, tls.crt and tls.key from the output

Edit Cluster1 config

Edit cilium config

Add the new info from the second cluster on the cilium config of the first cluster. The new config should looks like this:

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-cilium
  namespace: kube-system
spec:
  valuesContent: |-
    externalWorkloads:
      enabled: true
    cluster:
      name: cluster1
      id: 1
    externalWorkloads:
      enabled: true
    clustermesh:
      useAPIServer: true
      config:
        enabled: true
        clusters:
        - name: cluster1
          ips:
          - <ip for the cluster one node>
          port: 32379
        - name: cluster2
          ips:
          - <ip for cluster2>
          port: 32379
          tls:
            cert: "The content of tls.crt from cluster2"
            key: "The content of tls.key from cluster2"
            caCert: "The content of ca.crt from cluster2"

Restart RKE2

sudo service rke2-server restart

The new configuration should be updated I checked the status and it was fine

cilium clustermesh status
⚠  Service type NodePort detected! Service may fail when nodes are removed from the cluster!
✅ Service "clustermesh-apiserver" of type "NodePort" found
✅ Cluster access information is available:
  - 10.1.1.11:32379
✅ Deployment clustermesh-apiserver is ready
ℹ  KVStoreMesh is disabled

✅ All 1 nodes are connected to all clusters [min:1 / avg:1.0 / max:1]

🔌 Cluster Connections:
  - cluster2: 1/1 configured, 1/1 connected

🔀 Global services: [ min:0 / avg:0.0 / max:0 ]

I got this from the first cluster.

brandond commented 4 days ago

That's great, we should add that to the docs @rbrtbnfgl !

rbrtbnfgl commented 4 days ago

Maybe there is a way to generate and add the certificate before so you don't need to restart the nodes but you can configure cilium with the already generated ones.

BloodyIron commented 14 hours ago

This isn't about clustermesh specifically, that was simply an example of a function that is exclusive to the cilium cli tool.

And to clarify, I did not mean clustermesh to be on by default, but that the cilium cli binary be present by default, and not a symlink to cilium-dbg as it is currently.

Sorry for any time cost you may have had on this @rbrtbnfgl but yeah I wasn't specifically talking about (only) clustermesh, but the cilium cli tool being present so that it can be used. Namely in scenarios such as the cilium official troubleshooting documentation.

The reason I engaged rke2 on this topic first is it seemed plausible to me (as an outsider) that the way rke2 is implementing cilium "made" this change (not having cilium cli application) since it would be silly for the Cilium team themselves to do that (since it would break a good chunk of the troubleshooting documentation, as we're seeing).

But if there's no way from an rke2 "perspective" to get the cilium cli app existing (as in not a symlink) well I can appeal to the Cilium people (or maybe rke2 team could?). But... are all options exhausted?

There's still plenty for me to learn when it comes to k8s ;)

rbrtbnfgl commented 13 hours ago

We don't modify anything on the Cilium image. I think that the image has the Cilium-dbg by design. If you don't want to install anything on the node you could use the Cilium cli by any client it needs only the credentials to the cluster like kubectl.

rancher / rke2

Cilium CLI missing (symlink when should actually include the tool #6247

Create Cluster1:

RKE2 config:

Cilium value config:

Configure Cluster2:

RKE2 config:

Cilium Config:

Edit Cluster1 config

Edit cilium config

Restart RKE2