linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.61k stars 1.27k forks source link

Linkerd CNI mode, raceconditon with calico-node restart #11228

Closed ostavnaas closed 5 months ago

ostavnaas commented 1 year ago

What is the issue?

When using Linkerd in CNI mode with Calico, there are still race conditions. I see there are several related issues like #4789 and #4049 that address the issue, and some of the problems are solved.

But there still is race conditions when linkerd-cni starts before calico-node. In my case it happens while installing Calico and Linkerd together with Helmfile (helm). That causes all linkerd-control-operator pods to get stuck in Init:CrashLoopBackOff

Other cases that I have found more details are when calico-node restarts while linkerd-cni pod are still running. calico will override 10-calico.conflist, but watches in install-cni.sh does not work, because file is only modified. Then linkerd-control-operator pods will get stuck if restarted/redeployed

How can it be reproduced?

Tested with Kind and Kubernetes 1.26.6 Install calico, linkerd-cni and linkerd-control-operator Helm charts installed

calico                          tigera-operator         1               2023-08-09 09:32:46.735083977 +0200 CEST        deployed        tigera-operator-v3.26.1         v3.26.1      
linkerd-control-plane           linkerd                 1               2023-08-10 08:04:10.431260789 +0200 CEST        failed          linkerd-control-plane-1.12.6    stable-2.13.6
linkerd-crds                    linkerd                 2               2023-08-10 08:04:09.110529241 +0200 CEST        deployed        linkerd-crds-1.6.1                           
linkerd2-cni                    linkerd-cni             6               2023-08-10 08:04:10.0718757 +0200 CEST          deployed        linkerd2-cni-30.8.3             stable-2.13.5

linkerd-control-plane values.yaml

cniEnabled: true
kubectl rollout restart -n calico-system daemonset calico-node
kubectl rollout restart -n linkerd deployment linkerd-destination linkerd-identity linkerd-proxy-injector
linkerd-destination-57f8bbb568-6l887      4/4     Running      0             39m
linkerd-destination-854dc5d654-zc9nc      0/4     Init:Error   1 (34s ago)   66s
linkerd-identity-854b49ddb7-dkb55         0/2     Init:Error   1 (34s ago)   66s
linkerd-identity-f9f9b5b6f-nqqvh          2/2     Running      0             39m
linkerd-proxy-injector-5668d6ff7c-57h79   0/2     Init:Error   1 (34s ago)   66s
linkerd-proxy-injector-78b97dd47f-8t8zn   2/2     Running      0             39m

Linkerd is not able to recover, unless you manually restart linkerd-cni

Logs, error output, etc

kubectl logs linkerd-cni-j8d9k
Wrote linkerd CNI binaries to /host/opt/cni/bin
Installing CNI configuration in "chained" mode for /host/etc/cni/net.d/10-calico.conflist
Using CNI config template from CNI_NETWORK_CONFIG environment variable.
      "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
      "k8s_api_root": "https://10.96.0.1:__KUBERNETES_SERVICE_PORT__",
CNI config: {
  "name": "linkerd-cni",
  "type": "linkerd-cni",
  "log_level": "info",
  "policy": {
      "type": "k8s",
      "k8s_api_root": "https://10.96.0.1:443",
      "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
  },
  "kubernetes": {
      "kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig"
  },
  "linkerd": {
    "incoming-proxy-port": 4143,
    "outgoing-proxy-port": 4140,
    "proxy-uid": 2102,
    "ports-to-redirect": [],
    "inbound-ports-to-ignore": ["4191","4190"],
    "simulate": false,
    "use-wait-flag": false
  }
}
Created CNI config /host/etc/cni/net.d/10-calico.conflist
Setting up watches.
Watches established.

No changes was detected in /host/etc/cni/net.d/10-calico.conflist because it only watches for create and delete https://github.com/linkerd/linkerd2/blob/main/cni-plugin/deployment/scripts/install-cni.sh#L308

output of linkerd check -o short

linkerd-version

‼ cli is up-to-date is running version 2.13.5 but the latest stable version is 2.13.6 see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version

‼ control plane and cli versions match control plane running stable-2.13.6 but cli running stable-2.13.5 see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy

× control plane proxies are healthy pod "linkerd-destination-854dc5d654-zc9nc" status is Pending see https://linkerd.io/2.13/checks/#l5d-cp-proxy-healthy for hints

Status check results are ×

Environment

Possible solution

It could be solved by adding modify, but not sure what other edge cases that could cause inotifywait -m "${HOST_CNI_NET}" -e create,delete,modify

Additional context

No response

Would you like to work on fixing this bug?

None

alpeb commented 1 year ago

You need to upgrade your linkerd2-cni chart as well to 30.8.4. Please give that a try and let me know how it goes.

ostavnaas commented 1 year ago

After update to linkerd2-cni==30.8.4 I tried to recreate the cluster, but still does not work flawless on installation.

cat kind-config.yaml
....
networking:
  disableDefaultCNI: true
  podSubnet: 10.0.0.0/16
kind create cluster --name kind-dev --config kind-config.yaml
...
helmfile  sync
...

kubectl get pods -A -w 
NAMESPACE            NAME                                               READY   STATUS    RESTARTS   AGE
kube-system          coredns-787d4945fb-4nwdh                           0/1     Pending   0          5s
kube-system          coredns-787d4945fb-4tvzz                           0/1     Pending   0          5s
kube-system          etcd-kind-dev-control-plane                      1/1     Running   0          22s
kube-system          kube-apiserver-kind-dev-control-plane            1/1     Running   0          22s
kube-system          kube-controller-manager-kind-dev-control-plane   1/1     Running   0          20s
kube-system          kube-proxy-rxj2b                                   1/1     Running   0          5s
kube-system          kube-scheduler-kind-dev-control-plane            1/1     Running   0          20s
local-path-storage   local-path-provisioner-6bd6454576-94x46            0/1     Pending   0          5s
tigera-operator      tigera-operator-78d7857c44-wfr72                   0/1     Pending   0          0s
tigera-operator      tigera-operator-78d7857c44-wfr72                   0/1     Pending   0          1s
tigera-operator      tigera-operator-78d7857c44-wfr72                   0/1     ContainerCreating   0          1s
tigera-operator      tigera-operator-78d7857c44-wfr72                   1/1     Running             0          5s
calico-system        calico-typha-5d696b77d9-zxrsk                      0/1     Pending             0          0s
calico-system        calico-typha-5d696b77d9-zxrsk                      0/1     Pending             0          0s
calico-system        calico-typha-5d696b77d9-zxrsk                      0/1     ContainerCreating   0          0s
calico-system        calico-node-l678x                                  0/1     Pending             0          0s
calico-system        calico-node-l678x                                  0/1     Pending             0          0s
calico-system        calico-node-l678x                                  0/1     Init:0/2            0          0s
calico-system        csi-node-driver-4xsr6                              0/2     Pending             0          0s
calico-system        csi-node-driver-4xsr6                              0/2     Pending             0          0s
calico-system        calico-kube-controllers-74877d4945-mcfvd           0/1     Pending             0          0s
calico-system        calico-kube-controllers-74877d4945-mcfvd           0/1     Pending             0          0s
calico-system        csi-node-driver-4xsr6                              0/2     ContainerCreating   0          0s
linkerd-cni          linkerd-cni-fdsxs                                  0/1     Pending             0          0s
linkerd-cni          linkerd-cni-fdsxs                                  0/1     Pending             0          0s
linkerd-cni          linkerd-cni-fdsxs                                  0/1     ContainerCreating   0          0s
linkerd              linkerd-proxy-injector-5dff786bfc-jgrfm            0/2     Pending             0          0s
linkerd              linkerd-identity-9469bb5b7-cxzmv                   0/2     Pending             0          0s
linkerd              linkerd-destination-7cd8bd9b78-xbm7v               0/4     Pending             0          0s
linkerd              linkerd-proxy-injector-5dff786bfc-jgrfm            0/2     Pending             0          0s
linkerd              linkerd-identity-9469bb5b7-cxzmv                   0/2     Pending             0          0s
linkerd              linkerd-destination-7cd8bd9b78-xbm7v               0/4     Pending             0          0s
calico-system        calico-node-l678x                                  0/1     Init:1/2            0          5s
calico-system        calico-typha-5d696b77d9-zxrsk                      0/1     Running             0          8s
calico-system        calico-typha-5d696b77d9-zxrsk                      1/1     Running             0          9s
kube-system          coredns-787d4945fb-4tvzz                           0/1     Pending             0          33s
kube-system          coredns-787d4945fb-4nwdh                           0/1     Pending             0          33s
local-path-storage   local-path-provisioner-6bd6454576-94x46            0/1     Pending             0          33s
calico-system        calico-kube-controllers-74877d4945-mcfvd           0/1     Pending             0          13s
linkerd              linkerd-identity-9469bb5b7-cxzmv                   0/2     Pending             0          10s
linkerd              linkerd-destination-7cd8bd9b78-xbm7v               0/4     Pending             0          10s
linkerd              linkerd-proxy-injector-5dff786bfc-jgrfm            0/2     Pending             0          10s
kube-system          coredns-787d4945fb-4tvzz                           0/1     ContainerCreating   0          33s
kube-system          coredns-787d4945fb-4nwdh                           0/1     ContainerCreating   0          36s
local-path-storage   local-path-provisioner-6bd6454576-94x46            0/1     ContainerCreating   0          37s
calico-system        calico-kube-controllers-74877d4945-mcfvd           0/1     ContainerCreating   0          17s
linkerd              linkerd-destination-7cd8bd9b78-xbm7v               0/4     Init:0/1            0          15s
calico-system        calico-node-l678x                                  0/1     PodInitializing     0          18s
linkerd              linkerd-proxy-injector-5dff786bfc-jgrfm            0/2     Init:0/1            0          16s
linkerd              linkerd-identity-9469bb5b7-cxzmv                   0/2     Init:0/1            0          16s
calico-system        calico-node-l678x                                  0/1     Running             0          20s
local-path-storage   local-path-provisioner-6bd6454576-94x46            0/1     ContainerCreating   0          46s
local-path-storage   local-path-provisioner-6bd6454576-94x46            0/1     ContainerCreating   0          46s
kube-system          coredns-787d4945fb-4tvzz                           0/1     ContainerCreating   0          46s
calico-system        calico-kube-controllers-74877d4945-mcfvd           0/1     ContainerCreating   0          26s
linkerd              linkerd-destination-7cd8bd9b78-xbm7v               0/4     Init:0/1            0          23s
local-path-storage   local-path-provisioner-6bd6454576-94x46            1/1     Running             0          47s
kube-system          coredns-787d4945fb-4tvzz                           0/1     ContainerCreating   0          47s
calico-system        calico-kube-controllers-74877d4945-mcfvd           0/1     ContainerCreating   0          27s
linkerd              linkerd-destination-7cd8bd9b78-xbm7v               0/4     Init:0/1            0          24s
kube-system          coredns-787d4945fb-4tvzz                           0/1     Running             0          48s
kube-system          coredns-787d4945fb-4tvzz                           1/1     Running             0          48s
calico-system        csi-node-driver-4xsr6                              0/2     ContainerCreating   0          28s
linkerd              linkerd-proxy-injector-5dff786bfc-jgrfm            0/2     Init:0/1            0          25s
calico-system        csi-node-driver-4xsr6                              0/2     ContainerCreating   0          29s
linkerd              linkerd-proxy-injector-5dff786bfc-jgrfm            0/2     Init:0/1            0          26s
kube-system          coredns-787d4945fb-4nwdh                           0/1     ContainerCreating   0          49s
kube-system          coredns-787d4945fb-4nwdh                           0/1     ContainerCreating   0          50s
calico-system        calico-node-l678x                                  1/1     Running             0          30s
kube-system          coredns-787d4945fb-4nwdh                           0/1     Running             0          51s
kube-system          coredns-787d4945fb-4nwdh                           1/1     Running             0          51s
linkerd              linkerd-identity-9469bb5b7-cxzmv                   0/2     Init:0/1            0          28s
linkerd-cni          linkerd-cni-fdsxs                                  0/1     ContainerCreating   0          30s
linkerd-cni          linkerd-cni-fdsxs                                  0/1     ContainerCreating   0          31s
calico-system        calico-kube-controllers-74877d4945-mcfvd           0/1     Running             0          32s
linkerd              linkerd-identity-9469bb5b7-cxzmv                   0/2     Init:0/1            0          29s
calico-system        calico-kube-controllers-74877d4945-mcfvd           1/1     Running             0          33s
calico-apiserver     calico-apiserver-597785ccc7-b6t7m                  0/1     Pending             0          0s
calico-apiserver     calico-apiserver-597785ccc7-b6t7m                  0/1     Pending             0          0s
calico-apiserver     calico-apiserver-597785ccc7-gl5v2                  0/1     Pending             0          0s
calico-apiserver     calico-apiserver-597785ccc7-gl5v2                  0/1     Pending             0          0s
calico-apiserver     calico-apiserver-597785ccc7-b6t7m                  0/1     ContainerCreating   0          0s
calico-apiserver     calico-apiserver-597785ccc7-gl5v2                  0/1     ContainerCreating   0          0s
calico-apiserver     calico-apiserver-597785ccc7-gl5v2                  0/1     ContainerCreating   0          2s
calico-apiserver     calico-apiserver-597785ccc7-b6t7m                  0/1     ContainerCreating   0          2s
linkerd              linkerd-destination-7cd8bd9b78-xbm7v               0/4     Init:0/1            0          36s
linkerd              linkerd-proxy-injector-5dff786bfc-jgrfm            0/2     Init:0/1            0          40s
linkerd-cni          linkerd-cni-fdsxs                                  1/1     Running             0          45s
linkerd              linkerd-identity-9469bb5b7-cxzmv                   0/2     Init:0/1            0          44s
calico-apiserver     calico-apiserver-597785ccc7-gl5v2                  0/1     Running             0          17s
calico-apiserver     calico-apiserver-597785ccc7-b6t7m                  0/1     Running             0          17s
calico-system        csi-node-driver-4xsr6                              2/2     Running             0          54s
calico-apiserver     calico-apiserver-597785ccc7-b6t7m                  1/1     Running             0          22s
calico-apiserver     calico-apiserver-597785ccc7-gl5v2                  1/1     Running             0          22s
linkerd              linkerd-destination-7cd8bd9b78-xbm7v               0/4     Init:Error          0          66s
linkerd              linkerd-destination-7cd8bd9b78-xbm7v               0/4     Init:0/1            1 (1s ago)   67s
linkerd              linkerd-proxy-injector-5dff786bfc-jgrfm            0/2     Init:Error          0            70s
linkerd              linkerd-proxy-injector-5dff786bfc-jgrfm            0/2     Init:0/1            1 (2s ago)   71s
linkerd              linkerd-identity-9469bb5b7-cxzmv                   0/2     Init:Error          0            74s
linkerd              linkerd-identity-9469bb5b7-cxzmv                   0/2     Init:0/1            1 (2s ago)   75s
linkerd              linkerd-destination-7cd8bd9b78-xbm7v               0/4     Init:Error          1 (31s ago)   97s
linkerd              linkerd-proxy-injector-5dff786bfc-jgrfm            0/2     Init:Error          1 (32s ago)   101s
linkerd              linkerd-identity-9469bb5b7-cxzmv                   0/2     Init:Error          1 (32s ago)   105s
linkerd              linkerd-destination-7cd8bd9b78-xbm7v               0/4     Init:CrashLoopBackOff   1 (11s ago)   107s

In this first scenario during setup, linkerd-cni was able to correctly setup 10-calico.conflist but was still unable to start linkerd, until i ran kubectl rollout restart -n linkerd deployment linkerd-destination linkerd-identity linkerd-proxy-injector

For the second scenario where calico-node daemon is restarted, linkerd-cni is unable to update 10-calico.conflist


kubectl rollout restart -n calico-system daemonset calico-node
kubectl logs linkerd-cni-x2cm7
[2023-08-11 06:16:16] Wrote linkerd CNI binaries to /host/opt/cni/bin
[2023-08-11 06:16:16] Installing CNI configuration for /host/etc/cni/net.d/10-calico.conflist
[2023-08-11 06:16:16] Using CNI config template from CNI_NETWORK_CONFIG environment variable.
      "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
      "k8s_api_root": "https://10.96.0.1:__KUBERNETES_SERVICE_PORT__",
[2023-08-11 06:16:16] CNI config: {
  "name": "linkerd-cni",
  "type": "linkerd-cni",
  "log_level": "info",
  "policy": {
      "type": "k8s",
      "k8s_api_root": "https://10.96.0.1:443",
      "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
  },
  "kubernetes": {
      "kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig"
  },
  "linkerd": {
    "incoming-proxy-port": 4143,
    "outgoing-proxy-port": 4140,
    "proxy-uid": 2102,
    "ports-to-redirect": [],
    "inbound-ports-to-ignore": ["4191","4190"],
    "simulate": false,
    "use-wait-flag": false
  }
}
[2023-08-11 06:16:16] Created CNI config /host/etc/cni/net.d/10-calico.conflist
Setting up watches.
Watches established.

kubectl exec -it linkerd-cni-x2cm7 -- cat /host/etc/cni/net.d/10-calico.conflist 
{
                          "name": "k8s-pod-network",
                          "cniVersion": "0.3.1",
                          "plugins": [{"container_settings":{"allow_ip_forwarding":false},"datastore_type":"kubernetes","ipam":{"assign_ipv4":"true","assign_ipv6":"false","type":"calico-ipam"},"kubernetes":{"k8s_api_root":"https://10.96.0.1:443","kubeconfig":"/etc/cni/net.d/calico-kubeconfig"},"log_file_max_age":30,"log_file_max_count":10,"log_file_max_size":100,"log_file_path":"/var/log/calico/cni/cni.log","log_level":"Info","mtu":0,"nodename_file_optional":false,"policy":{"type":"k8s"},"type":"calico"},{"capabilities":{"bandwidth":true},"type":"bandwidth"},{"capabilities":{"portMappings":true},"snat":true,"type":"portmap"}] 
                        }%                                
kubectl rollout restart -n linkerd deployment linkerd-destination linkerd-identity linkerd-proxy-injector            
kubectl get pods -n linkerd
NAME                                      READY   STATUS                  RESTARTS      AGE
linkerd-destination-59f98b955-skjv7       0/4     Init:CrashLoopBackOff   3 (20s ago)   3m2s
linkerd-destination-8667db85b-9mdgf       4/4     Running                 0             11m
linkerd-identity-5c98f988fd-qdjfg         0/2     Init:CrashLoopBackOff   3 (22s ago)   3m2s
linkerd-identity-748d6d97d5-zrfnd         2/2     Running                 0             11m
linkerd-proxy-injector-559d6f968b-9k5mm   0/2     Init:CrashLoopBackOff   3 (18s ago)   3m2s
linkerd-proxy-injector-56ffbdd55f-spbx4   2/2     Running                 0             11m
alpeb commented 1 year ago

Thanks for the detailed feedback. Focusing on the second scenario, it does look indeed like linkerd-cni isn't properly listening to 10-calico.conflist changes. Note that the code you referred earlier in the linkerd2 repo is no longer used (it's high time we delete that), and instead we're using the linkerd2-proxy-init repo. You'll see there that we're listening to the moved_to event. Maybe adding modify like you suggest would do the trick? I'd like however to understand better what's going on before making that change. Could you share a version of your kind and helmfile configs that would allow me to repro this?

ostavnaas commented 1 year ago

Helmfile does not matter as much, here you can reproduce the issue with Helm and Kind

kind --version
kind version 0.20.0

helm version  
version.BuildInfo{Version:"v3.11.1", GitCommit:"293b50c65d4d56187cd4e2f390f0ada46b4c4737", GitTreeState:"clean", GoVersion:"go1.18.10"}
==> calico.yaml <==
installation:
  kubernetesProvider: ""
  cni:
    type: Calico
  calicoNetwork:
    ipPools:
    - cidr: 10.100.0.0/16
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()

==> kind-config.yaml <==
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    image: kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72
networking:
  disableDefaultCNI: true
  podSubnet: 10.100.0.0/16

==> linkerd.yaml <==
identityTrustAnchorsPEM: |
  -----BEGIN CERTIFICATE-----
  MIIBjTCCATSgAwIBAgIRALMXoARwnDtQJ7US5nK/4tswCgYIKoZIzj0EAwIwJTEj
  MCEGA1UEAxMacm9vdC5saW5rZXJkLmNsdXN0ZXIubG9jYWwwHhcNMjMwNjIyMDUw
  NTQ5WhcNMzMwNjE5MDUwNTQ5WjAlMSMwIQYDVQQDExpyb290LmxpbmtlcmQuY2x1
  c3Rlci5sb2NhbDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABAq/TIdzyK053XRK
  qtVsjO1TNDjRMW/JJAao+IQKoPZ1hDVkOuF7sm8CMZzxeoZzatEp8tHQIgzNf/p+
  e4My3fqjRTBDMA4GA1UdDwEB/wQEAwIBBjASBgNVHRMBAf8ECDAGAQH/AgEBMB0G
  A1UdDgQWBBTyXBIrlHeQw+v23A8M3vGlK4Fo0TAKBggqhkjOPQQDAgNHADBEAiAo
  eOQP/Ph+DitUv0pB+bQdBPIRhFQXxroqhtbz4h2FAwIgRDBemEdr1FZgLgBvFKS9
  bbEyRH4awLA1bkYy8+n+VW0=
  -----END CERTIFICATE-----

identity:
  issuer:
    tls:
      crtPEM: |
        -----BEGIN CERTIFICATE-----
        MIIBsTCCAVigAwIBAgIQWQBydcPUL+SfWE7eJbQdATAKBggqhkjOPQQDAjAlMSMw
        IQYDVQQDExpyb290LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAeFw0yMzA2MjIwNTA1
        NDlaFw0yNDA2MjEwNTA1NDlaMCkxJzAlBgNVBAMTHmlkZW50aXR5LmxpbmtlcmQu
        Y2x1c3Rlci5sb2NhbDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABA362DkfI6BS
        DF/mKC4bT/7vGpRayCgTG61dEucpZV5/ghTCzcS+YsscpG81DID7KHHqzRInWJBz
        Me9Pod0exs2jZjBkMA4GA1UdDwEB/wQEAwIBBjASBgNVHRMBAf8ECDAGAQH/AgEA
        MB0GA1UdDgQWBBTuOWWQpKtP2TVklSoRbPTZuBtkSTAfBgNVHSMEGDAWgBTyXBIr
        lHeQw+v23A8M3vGlK4Fo0TAKBggqhkjOPQQDAgNHADBEAiAJNmq0n97ANIGaI0+l
        D4Ro0fwLuNJ4MHQ9m0gUz7+VfwIgdrCckcm7ZhJC+Bekudm+FT9hSvjAtfFb0gF1
         4hS3ozY=
        -----END CERTIFICATE-----
      keyPEM: |
        -----BEGIN EC PRIVATE KEY-----
        MHcCAQEEIHRHu2CpQCbWvW3flJRZhEGO3e9/7NhO4l3LeZ/F9kQHoAoGCCqGSM49
        AwEHoUQDQgAEDfrYOR8joFIMX+YoLhtP/u8alFrIKBMbrV0S5yllXn+CFMLNxL5i
        yxykbzUMgPsocerNEidYkHMx70+h3R7GzQ==
        -----END EC PRIVATE KEY-----
cniEnabled: true
networkValidator:
  timeout: 30s

==> run.sh <==
kind create cluster --name linkerd --config kind-config.yaml

helm repo add projectcalico https://docs.tigera.io/calico/charts
helm repo add linkerd https://helm.linkerd.io/stable
helm repo update
helm install calico projectcalico/tigera-operator -f calico.yaml --wait
helm install  linkerd-crds linkerd/linkerd-crds --wait
helm install  linkerd2-cni linkerd/linkerd2-cni --wait
helm install linkerd-control-plane linkerd/linkerd-control-plane  -f linkerd.yaml --wait

# Need to run command bellow, or linkerd ends in Init:CrashLoopBackOff
# kubectl rollout restart  deployment linkerd-destination linkerd-identity linkerd-proxy-injector

kubectl exec daemonsets/linkerd-cni -- cat /host/etc/cni/net.d/10-calico.conflist
kubectl rollout restart -n calico-system daemonset calico-node
sleep 10
kubectl exec daemonsets/linkerd-cni -- cat /host/etc/cni/net.d/10-calico.conflist
ostavnaas commented 1 year ago

This bug hit our test cluster running as Microsoft managed cluster AKS Pods was not able to start
Back-off restarting failed container linkerd-network-validator in pod

Calico node pods where newer than the linkerd-cni (Probably because Microsoft upgraded Calico)


$ kubectl get pods -n calico-system
calico-node-c82h8                         1/1     Running   0          2d8h                                                                                                                                                                                                                 
calico-node-kjrzr                         1/1     Running   0          2d8h                                                                                                                                                                                                                 
calico-node-ljtqv                         1/1     Running   0          2d8h                                                                                                                                                                                                                 
calico-node-ngwhl                         1/1     Running   0          2d8h                                                                                                                                                                                                                 
calico-node-xc98n                         1/1     Running   0          2d8h              
...

$ kubectl get pods -n linkerd-cni
linkerd-cni-chz8c   1/1     Running       0          29d                                                                                                                                                                                                                                    
linkerd-cni-dpwdx   1/1     Running   0          29d                                                                                                                                                                                                                                    
linkerd-cni-lkwhw   1/1     Running       0          25d                                                                                                                                                                                                                                    
linkerd-cni-pqj5k   1/1     Running       0          29d       

$ kubectl exec -it calico-node-kjrzr -- cat /host/etc/cni/net.d/10-calico.conflist                                                                                                                                                                                                           
{                                                                                                                                                                                                                                                                                           
  "name": "k8s-pod-network",                                                                                                                                                                                                                                                                
  "cniVersion": "0.3.1",                                                                                                                                                                                                                                                                    
  "plugins": [                                                                                                                                                                                                                                                                              
    {                                                                                                                                                                                                                                                                                       
      "type": "calico",                                                                                                                                                                                                                                                                     
      "datastore_type": "kubernetes",                                                                                                                                                                                                                                                       
      "mtu": 0,                                                                                                                                                                                                                                                                             
      "nodename_file_optional": false,                                                                                                                                                                                                                                                      
      "log_level": "Info",                                                                                                                                                                                                                                                                  
      "log_file_path": "/var/log/calico/cni/cni.log",                                                                                                                                                                                                                                       
      "ipam": { "type": "host-local", "subnet": "usePodCidr"},                                                                                                                                                                                                                              
      "container_settings": {                                                                                                                                                                                                                                                               
          "allow_ip_forwarding": true                                                                                                                                                                                                                                                       
      },                                                                                                                                                                                                                                                                                    
      "policy": {                                                                                                                                                                                                                                                                           
          "type": "k8s"                                                                                                                                                                                                                                                                     
      },                                                                                                                                                                                                                                                                                    
      "kubernetes": {                                                                                                                                                                                                                                                                       
          "k8s_api_root":"https://<sensitive>:443",                                                                                                                                                      
          "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"                                                                                                                                                                                                                                  
      }                                                                                                                                                                                                                                                                                     
    },                                                                                                                                                                                                                                                                                      
    {                                                                                                                                                                                                                                                                                       
      "type": "bandwidth",                                                                                                                                                                                                                                                                  
      "capabilities": {"bandwidth": true}                                                                                                                                                                                                                                                   
    },                                                                                                                                                                                                                                                                                      
    {"type": "portmap", "snat": true, "capabilities": {"portMappings": true}}                                                                                                                                                                                                               
  ]                                                                                                                                                                                                                                                                                         
}
Agalin commented 1 year ago

While we're also affected, it's not really a linkerd issue. Linkerd cni insertion works properly. It's calico that mindlessly overrides the whole CNI config on startup removing linkerd in the process.

Edit. My bad, while Calico is to blame, linkerd is already working around multiple CNI implementations doing the same thing.

mhmd3bdo commented 11 months ago

@alpeb We also have a similar issue, Linkerd CNI(stable-2.14.1) doesn't detect the changes done by Calico in AKS (1.25.6) when Calico pod get restarted. I tried to catch the event happens in the directory when calico pod gets killed and restarted on the node and I see modify but no CREATE and DELETE, so I would assume that would be the fix.

root@linkerd-cni-9h2rc:/linkerd $ inotifywait -m /host/etc/cni/net.d                        
Setting up watches.
Watches established.
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ MODIFY calico-kubeconfig
/host/etc/cni/net.d/ OPEN calico-kubeconfig
/host/etc/cni/net.d/ MODIFY calico-kubeconfig
/host/etc/cni/net.d/ CLOSE_WRITE,CLOSE calico-kubeconfig
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ MODIFY 10-calico.conflist
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ MODIFY 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_WRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ MODIFY calico-kubeconfig
/host/etc/cni/net.d/ OPEN calico-kubeconfig
/host/etc/cni/net.d/ MODIFY calico-kubeconfig
/host/etc/cni/net.d/ CLOSE_WRITE,CLOSE calico-kubeconfig
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN,ISDIR 
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ ACCESS,ISDIR 
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR 
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
Agalin commented 10 months ago

I've modified install-cni.sh to include modify in observed events and to try updating config for MODIFY event. Seems to fix the issue.

Linkerd CNI pod log ``` [2023-11-14 12:47:36] Wrote linkerd CNI binaries to /host/opt/cni/bin [2023-11-14 12:47:36] Installing CNI configuration for /host/etc/cni/net.d/10-calico.conflist [2023-11-14 12:47:36] Using CNI config template from CNI_NETWORK_CONFIG environment variable. "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__", "k8s_api_root": "https://:__KUBERNETES_SERVICE_PORT__", [2023-11-14 12:47:36] CNI config: { "name": "linkerd-cni", "type": "linkerd-cni", "log_level": "info", "policy": { "type": "k8s", "k8s_api_root": "https://", "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__" }, "kubernetes": { "kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig" }, "linkerd": { "incoming-proxy-port": , "outgoing-proxy-port": , "proxy-uid": , "ports-to-redirect": [], "inbound-ports-to-ignore": ["",""], "simulate": false, "use-wait-flag": false } } [2023-11-14 12:47:36] Created CNI config /host/etc/cni/net.d/10-calico.conflist Setting up watches. Watches established. [2023-11-14 12:51:11] Detected change in /host/etc/cni/net.d/: MODIFY 10-calico.conflist [2023-11-14 12:51:11] New file [10-calico.conflist] detected; re-installing [2023-11-14 12:51:11] Using CNI config template from CNI_NETWORK_CONFIG environment variable. "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__", "k8s_api_root": "https://:__KUBERNETES_SERVICE_PORT__", [2023-11-14 12:51:11] CNI config: { "name": "linkerd-cni", "type": "linkerd-cni", "log_level": "info", "policy": { "type": "k8s", "k8s_api_root": "https://", "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__" }, "kubernetes": { "kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig" }, "linkerd": { "incoming-proxy-port": , "outgoing-proxy-port": , "proxy-uid": , "ports-to-redirect": [], "inbound-ports-to-ignore": ["",""], "simulate": false, "use-wait-flag": false } } [2023-11-14 12:51:11] Created CNI config /host/etc/cni/net.d/10-calico.conflist [2023-11-14 12:51:11] Detected change in /host/etc/cni/net.d/: MODIFY 10-calico.conflist [2023-11-14 12:51:11] Ignoring event: MODIFY /host/etc/cni/net.d/10-calico.conflist; no real changes detected [2023-11-14 12:51:11] Detected change in /host/etc/cni/net.d/: DELETE 10-calico.conflist [2023-11-14 12:51:11] Detected change in /host/etc/cni/net.d/: CREATE 10-calico.conflist [2023-11-14 12:51:11] Ignoring event: CREATE /host/etc/cni/net.d/10-calico.conflist; no real changes detected [2023-11-14 12:51:11] Detected change in /host/etc/cni/net.d/: MODIFY 10-calico.conflist [2023-11-14 12:51:11] Ignoring event: MODIFY /host/etc/cni/net.d/10-calico.conflist; no real changes detected ```

The following kustomize patch was used on the chart v30.11.0:

- op: add
  path: /spec/template/spec/containers/0/args
  value:
    - bash
    - -c
    - |
      sed s/create,delete,moved_to/create,delete,moved_to,modify/ install-cni.sh > /tmp/install-cni.sh;
      sed -i -E 's@(= .CREATE.) (\] \|\| \[)@\1 \2  \"\$ev\" = "MODIFY" \2 @' /tmp/install-cni.sh;
      chmod +x /tmp/install-cni.sh;
      /tmp/install-cni.sh

Small note: I'm not sure where to look for the current version of install-cni.sh, the one I can find differs (doesn't include MOVED_TO watches) and seems to be an earlier version.

mateiidavid commented 10 months ago

@Agalin nice! Thanks for giving it a try. You can find the up-to-date script in our proxy-init repo. We're in the midst of a clean-up so both repos have the logic but the proxy-init is the one where we actively develop it.

xsoheilalizadeh commented 9 months ago

We are also effected by this on AKS, would upgrading to v30.11.0 chart solve the issue?

stale[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

jonathansloman commented 6 months ago

This issue is affecting us as well - is there a plan to release a new version with a fix? (seen this has just been marked as 'wontfix')

xsoheilalizadeh commented 6 months ago

@jonathansloman, We fixed it by using the repair controller.

Example

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: linkerd-cni
  namespace: linkerd-cni
spec:
  interval: 5m
  dependsOn:
    - name: linkerd-crds
      namespace: linkerd
  chart:
    spec:
      chart: linkerd2-cni
      version: 30.12.2
      sourceRef:
        kind: HelmRepository
        name: linkerd
        namespace: flux-system
      interval: 60m
  values:
    repairController:
      enabled: true       # this option
noraschaefer commented 6 months ago

We are affected by this issue as well. We are already using the repair-controller. While it seems to fix some race conditions, it does not fix the problem that is described in this issue. In our case, when the Calico DaemonSet is restarted while the linkerd-cni pods are still running, Calico overwrites 10-calico.conflist. But linkerd-cni does not notice it, because it's not listening to "modify" events, and carries on running, while our pods are failing to start up. The repair-controller just restarts our pods (and not the linkerd-cni pods) over and over again, and they keep failing until we manually restart the linkerd-cni DaemonSet. We applied the workaround that @Agalin suggested and it solved the issue for us. Thanks @Agalin! It would be good to have an official fix though.

Agalin commented 6 months ago

Just FYI: we have been using this workaround since my last message and I'm not aware of any issues related to it in those 4 months.