Closed ostavnaas closed 5 months ago
You need to upgrade your linkerd2-cni chart as well to 30.8.4
. Please give that a try and let me know how it goes.
After update to linkerd2-cni==30.8.4
I tried to recreate the cluster, but still does not work flawless on installation.
cat kind-config.yaml
....
networking:
disableDefaultCNI: true
podSubnet: 10.0.0.0/16
kind create cluster --name kind-dev --config kind-config.yaml
...
helmfile sync
...
kubectl get pods -A -w
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-787d4945fb-4nwdh 0/1 Pending 0 5s
kube-system coredns-787d4945fb-4tvzz 0/1 Pending 0 5s
kube-system etcd-kind-dev-control-plane 1/1 Running 0 22s
kube-system kube-apiserver-kind-dev-control-plane 1/1 Running 0 22s
kube-system kube-controller-manager-kind-dev-control-plane 1/1 Running 0 20s
kube-system kube-proxy-rxj2b 1/1 Running 0 5s
kube-system kube-scheduler-kind-dev-control-plane 1/1 Running 0 20s
local-path-storage local-path-provisioner-6bd6454576-94x46 0/1 Pending 0 5s
tigera-operator tigera-operator-78d7857c44-wfr72 0/1 Pending 0 0s
tigera-operator tigera-operator-78d7857c44-wfr72 0/1 Pending 0 1s
tigera-operator tigera-operator-78d7857c44-wfr72 0/1 ContainerCreating 0 1s
tigera-operator tigera-operator-78d7857c44-wfr72 1/1 Running 0 5s
calico-system calico-typha-5d696b77d9-zxrsk 0/1 Pending 0 0s
calico-system calico-typha-5d696b77d9-zxrsk 0/1 Pending 0 0s
calico-system calico-typha-5d696b77d9-zxrsk 0/1 ContainerCreating 0 0s
calico-system calico-node-l678x 0/1 Pending 0 0s
calico-system calico-node-l678x 0/1 Pending 0 0s
calico-system calico-node-l678x 0/1 Init:0/2 0 0s
calico-system csi-node-driver-4xsr6 0/2 Pending 0 0s
calico-system csi-node-driver-4xsr6 0/2 Pending 0 0s
calico-system calico-kube-controllers-74877d4945-mcfvd 0/1 Pending 0 0s
calico-system calico-kube-controllers-74877d4945-mcfvd 0/1 Pending 0 0s
calico-system csi-node-driver-4xsr6 0/2 ContainerCreating 0 0s
linkerd-cni linkerd-cni-fdsxs 0/1 Pending 0 0s
linkerd-cni linkerd-cni-fdsxs 0/1 Pending 0 0s
linkerd-cni linkerd-cni-fdsxs 0/1 ContainerCreating 0 0s
linkerd linkerd-proxy-injector-5dff786bfc-jgrfm 0/2 Pending 0 0s
linkerd linkerd-identity-9469bb5b7-cxzmv 0/2 Pending 0 0s
linkerd linkerd-destination-7cd8bd9b78-xbm7v 0/4 Pending 0 0s
linkerd linkerd-proxy-injector-5dff786bfc-jgrfm 0/2 Pending 0 0s
linkerd linkerd-identity-9469bb5b7-cxzmv 0/2 Pending 0 0s
linkerd linkerd-destination-7cd8bd9b78-xbm7v 0/4 Pending 0 0s
calico-system calico-node-l678x 0/1 Init:1/2 0 5s
calico-system calico-typha-5d696b77d9-zxrsk 0/1 Running 0 8s
calico-system calico-typha-5d696b77d9-zxrsk 1/1 Running 0 9s
kube-system coredns-787d4945fb-4tvzz 0/1 Pending 0 33s
kube-system coredns-787d4945fb-4nwdh 0/1 Pending 0 33s
local-path-storage local-path-provisioner-6bd6454576-94x46 0/1 Pending 0 33s
calico-system calico-kube-controllers-74877d4945-mcfvd 0/1 Pending 0 13s
linkerd linkerd-identity-9469bb5b7-cxzmv 0/2 Pending 0 10s
linkerd linkerd-destination-7cd8bd9b78-xbm7v 0/4 Pending 0 10s
linkerd linkerd-proxy-injector-5dff786bfc-jgrfm 0/2 Pending 0 10s
kube-system coredns-787d4945fb-4tvzz 0/1 ContainerCreating 0 33s
kube-system coredns-787d4945fb-4nwdh 0/1 ContainerCreating 0 36s
local-path-storage local-path-provisioner-6bd6454576-94x46 0/1 ContainerCreating 0 37s
calico-system calico-kube-controllers-74877d4945-mcfvd 0/1 ContainerCreating 0 17s
linkerd linkerd-destination-7cd8bd9b78-xbm7v 0/4 Init:0/1 0 15s
calico-system calico-node-l678x 0/1 PodInitializing 0 18s
linkerd linkerd-proxy-injector-5dff786bfc-jgrfm 0/2 Init:0/1 0 16s
linkerd linkerd-identity-9469bb5b7-cxzmv 0/2 Init:0/1 0 16s
calico-system calico-node-l678x 0/1 Running 0 20s
local-path-storage local-path-provisioner-6bd6454576-94x46 0/1 ContainerCreating 0 46s
local-path-storage local-path-provisioner-6bd6454576-94x46 0/1 ContainerCreating 0 46s
kube-system coredns-787d4945fb-4tvzz 0/1 ContainerCreating 0 46s
calico-system calico-kube-controllers-74877d4945-mcfvd 0/1 ContainerCreating 0 26s
linkerd linkerd-destination-7cd8bd9b78-xbm7v 0/4 Init:0/1 0 23s
local-path-storage local-path-provisioner-6bd6454576-94x46 1/1 Running 0 47s
kube-system coredns-787d4945fb-4tvzz 0/1 ContainerCreating 0 47s
calico-system calico-kube-controllers-74877d4945-mcfvd 0/1 ContainerCreating 0 27s
linkerd linkerd-destination-7cd8bd9b78-xbm7v 0/4 Init:0/1 0 24s
kube-system coredns-787d4945fb-4tvzz 0/1 Running 0 48s
kube-system coredns-787d4945fb-4tvzz 1/1 Running 0 48s
calico-system csi-node-driver-4xsr6 0/2 ContainerCreating 0 28s
linkerd linkerd-proxy-injector-5dff786bfc-jgrfm 0/2 Init:0/1 0 25s
calico-system csi-node-driver-4xsr6 0/2 ContainerCreating 0 29s
linkerd linkerd-proxy-injector-5dff786bfc-jgrfm 0/2 Init:0/1 0 26s
kube-system coredns-787d4945fb-4nwdh 0/1 ContainerCreating 0 49s
kube-system coredns-787d4945fb-4nwdh 0/1 ContainerCreating 0 50s
calico-system calico-node-l678x 1/1 Running 0 30s
kube-system coredns-787d4945fb-4nwdh 0/1 Running 0 51s
kube-system coredns-787d4945fb-4nwdh 1/1 Running 0 51s
linkerd linkerd-identity-9469bb5b7-cxzmv 0/2 Init:0/1 0 28s
linkerd-cni linkerd-cni-fdsxs 0/1 ContainerCreating 0 30s
linkerd-cni linkerd-cni-fdsxs 0/1 ContainerCreating 0 31s
calico-system calico-kube-controllers-74877d4945-mcfvd 0/1 Running 0 32s
linkerd linkerd-identity-9469bb5b7-cxzmv 0/2 Init:0/1 0 29s
calico-system calico-kube-controllers-74877d4945-mcfvd 1/1 Running 0 33s
calico-apiserver calico-apiserver-597785ccc7-b6t7m 0/1 Pending 0 0s
calico-apiserver calico-apiserver-597785ccc7-b6t7m 0/1 Pending 0 0s
calico-apiserver calico-apiserver-597785ccc7-gl5v2 0/1 Pending 0 0s
calico-apiserver calico-apiserver-597785ccc7-gl5v2 0/1 Pending 0 0s
calico-apiserver calico-apiserver-597785ccc7-b6t7m 0/1 ContainerCreating 0 0s
calico-apiserver calico-apiserver-597785ccc7-gl5v2 0/1 ContainerCreating 0 0s
calico-apiserver calico-apiserver-597785ccc7-gl5v2 0/1 ContainerCreating 0 2s
calico-apiserver calico-apiserver-597785ccc7-b6t7m 0/1 ContainerCreating 0 2s
linkerd linkerd-destination-7cd8bd9b78-xbm7v 0/4 Init:0/1 0 36s
linkerd linkerd-proxy-injector-5dff786bfc-jgrfm 0/2 Init:0/1 0 40s
linkerd-cni linkerd-cni-fdsxs 1/1 Running 0 45s
linkerd linkerd-identity-9469bb5b7-cxzmv 0/2 Init:0/1 0 44s
calico-apiserver calico-apiserver-597785ccc7-gl5v2 0/1 Running 0 17s
calico-apiserver calico-apiserver-597785ccc7-b6t7m 0/1 Running 0 17s
calico-system csi-node-driver-4xsr6 2/2 Running 0 54s
calico-apiserver calico-apiserver-597785ccc7-b6t7m 1/1 Running 0 22s
calico-apiserver calico-apiserver-597785ccc7-gl5v2 1/1 Running 0 22s
linkerd linkerd-destination-7cd8bd9b78-xbm7v 0/4 Init:Error 0 66s
linkerd linkerd-destination-7cd8bd9b78-xbm7v 0/4 Init:0/1 1 (1s ago) 67s
linkerd linkerd-proxy-injector-5dff786bfc-jgrfm 0/2 Init:Error 0 70s
linkerd linkerd-proxy-injector-5dff786bfc-jgrfm 0/2 Init:0/1 1 (2s ago) 71s
linkerd linkerd-identity-9469bb5b7-cxzmv 0/2 Init:Error 0 74s
linkerd linkerd-identity-9469bb5b7-cxzmv 0/2 Init:0/1 1 (2s ago) 75s
linkerd linkerd-destination-7cd8bd9b78-xbm7v 0/4 Init:Error 1 (31s ago) 97s
linkerd linkerd-proxy-injector-5dff786bfc-jgrfm 0/2 Init:Error 1 (32s ago) 101s
linkerd linkerd-identity-9469bb5b7-cxzmv 0/2 Init:Error 1 (32s ago) 105s
linkerd linkerd-destination-7cd8bd9b78-xbm7v 0/4 Init:CrashLoopBackOff 1 (11s ago) 107s
In this first scenario during setup, linkerd-cni was able to correctly setup 10-calico.conflist
but was still unable to start linkerd, until i ran kubectl rollout restart -n linkerd deployment linkerd-destination linkerd-identity linkerd-proxy-injector
For the second scenario where calico-node daemon is restarted, linkerd-cni is unable to update 10-calico.conflist
kubectl rollout restart -n calico-system daemonset calico-node
kubectl logs linkerd-cni-x2cm7
[2023-08-11 06:16:16] Wrote linkerd CNI binaries to /host/opt/cni/bin
[2023-08-11 06:16:16] Installing CNI configuration for /host/etc/cni/net.d/10-calico.conflist
[2023-08-11 06:16:16] Using CNI config template from CNI_NETWORK_CONFIG environment variable.
"k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
"k8s_api_root": "https://10.96.0.1:__KUBERNETES_SERVICE_PORT__",
[2023-08-11 06:16:16] CNI config: {
"name": "linkerd-cni",
"type": "linkerd-cni",
"log_level": "info",
"policy": {
"type": "k8s",
"k8s_api_root": "https://10.96.0.1:443",
"k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig"
},
"linkerd": {
"incoming-proxy-port": 4143,
"outgoing-proxy-port": 4140,
"proxy-uid": 2102,
"ports-to-redirect": [],
"inbound-ports-to-ignore": ["4191","4190"],
"simulate": false,
"use-wait-flag": false
}
}
[2023-08-11 06:16:16] Created CNI config /host/etc/cni/net.d/10-calico.conflist
Setting up watches.
Watches established.
kubectl exec -it linkerd-cni-x2cm7 -- cat /host/etc/cni/net.d/10-calico.conflist
{
"name": "k8s-pod-network",
"cniVersion": "0.3.1",
"plugins": [{"container_settings":{"allow_ip_forwarding":false},"datastore_type":"kubernetes","ipam":{"assign_ipv4":"true","assign_ipv6":"false","type":"calico-ipam"},"kubernetes":{"k8s_api_root":"https://10.96.0.1:443","kubeconfig":"/etc/cni/net.d/calico-kubeconfig"},"log_file_max_age":30,"log_file_max_count":10,"log_file_max_size":100,"log_file_path":"/var/log/calico/cni/cni.log","log_level":"Info","mtu":0,"nodename_file_optional":false,"policy":{"type":"k8s"},"type":"calico"},{"capabilities":{"bandwidth":true},"type":"bandwidth"},{"capabilities":{"portMappings":true},"snat":true,"type":"portmap"}]
}%
kubectl rollout restart -n linkerd deployment linkerd-destination linkerd-identity linkerd-proxy-injector
kubectl get pods -n linkerd
NAME READY STATUS RESTARTS AGE
linkerd-destination-59f98b955-skjv7 0/4 Init:CrashLoopBackOff 3 (20s ago) 3m2s
linkerd-destination-8667db85b-9mdgf 4/4 Running 0 11m
linkerd-identity-5c98f988fd-qdjfg 0/2 Init:CrashLoopBackOff 3 (22s ago) 3m2s
linkerd-identity-748d6d97d5-zrfnd 2/2 Running 0 11m
linkerd-proxy-injector-559d6f968b-9k5mm 0/2 Init:CrashLoopBackOff 3 (18s ago) 3m2s
linkerd-proxy-injector-56ffbdd55f-spbx4 2/2 Running 0 11m
Thanks for the detailed feedback. Focusing on the second scenario, it does look indeed like linkerd-cni isn't properly listening to 10-calico.conflist
changes. Note that the code you referred earlier in the linkerd2 repo is no longer used (it's high time we delete that), and instead we're using the linkerd2-proxy-init
repo. You'll see there that we're listening to the moved_to
event. Maybe adding modify
like you suggest would do the trick? I'd like however to understand better what's going on before making that change. Could you share a version of your kind and helmfile configs that would allow me to repro this?
Helmfile does not matter as much, here you can reproduce the issue with Helm and Kind
kind --version
kind version 0.20.0
helm version
version.BuildInfo{Version:"v3.11.1", GitCommit:"293b50c65d4d56187cd4e2f390f0ada46b4c4737", GitTreeState:"clean", GoVersion:"go1.18.10"}
==> calico.yaml <==
installation:
kubernetesProvider: ""
cni:
type: Calico
calicoNetwork:
ipPools:
- cidr: 10.100.0.0/16
encapsulation: VXLANCrossSubnet
natOutgoing: Enabled
nodeSelector: all()
==> kind-config.yaml <==
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
image: kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72
networking:
disableDefaultCNI: true
podSubnet: 10.100.0.0/16
==> linkerd.yaml <==
identityTrustAnchorsPEM: |
-----BEGIN CERTIFICATE-----
MIIBjTCCATSgAwIBAgIRALMXoARwnDtQJ7US5nK/4tswCgYIKoZIzj0EAwIwJTEj
MCEGA1UEAxMacm9vdC5saW5rZXJkLmNsdXN0ZXIubG9jYWwwHhcNMjMwNjIyMDUw
NTQ5WhcNMzMwNjE5MDUwNTQ5WjAlMSMwIQYDVQQDExpyb290LmxpbmtlcmQuY2x1
c3Rlci5sb2NhbDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABAq/TIdzyK053XRK
qtVsjO1TNDjRMW/JJAao+IQKoPZ1hDVkOuF7sm8CMZzxeoZzatEp8tHQIgzNf/p+
e4My3fqjRTBDMA4GA1UdDwEB/wQEAwIBBjASBgNVHRMBAf8ECDAGAQH/AgEBMB0G
A1UdDgQWBBTyXBIrlHeQw+v23A8M3vGlK4Fo0TAKBggqhkjOPQQDAgNHADBEAiAo
eOQP/Ph+DitUv0pB+bQdBPIRhFQXxroqhtbz4h2FAwIgRDBemEdr1FZgLgBvFKS9
bbEyRH4awLA1bkYy8+n+VW0=
-----END CERTIFICATE-----
identity:
issuer:
tls:
crtPEM: |
-----BEGIN CERTIFICATE-----
MIIBsTCCAVigAwIBAgIQWQBydcPUL+SfWE7eJbQdATAKBggqhkjOPQQDAjAlMSMw
IQYDVQQDExpyb290LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAeFw0yMzA2MjIwNTA1
NDlaFw0yNDA2MjEwNTA1NDlaMCkxJzAlBgNVBAMTHmlkZW50aXR5LmxpbmtlcmQu
Y2x1c3Rlci5sb2NhbDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABA362DkfI6BS
DF/mKC4bT/7vGpRayCgTG61dEucpZV5/ghTCzcS+YsscpG81DID7KHHqzRInWJBz
Me9Pod0exs2jZjBkMA4GA1UdDwEB/wQEAwIBBjASBgNVHRMBAf8ECDAGAQH/AgEA
MB0GA1UdDgQWBBTuOWWQpKtP2TVklSoRbPTZuBtkSTAfBgNVHSMEGDAWgBTyXBIr
lHeQw+v23A8M3vGlK4Fo0TAKBggqhkjOPQQDAgNHADBEAiAJNmq0n97ANIGaI0+l
D4Ro0fwLuNJ4MHQ9m0gUz7+VfwIgdrCckcm7ZhJC+Bekudm+FT9hSvjAtfFb0gF1
4hS3ozY=
-----END CERTIFICATE-----
keyPEM: |
-----BEGIN EC PRIVATE KEY-----
MHcCAQEEIHRHu2CpQCbWvW3flJRZhEGO3e9/7NhO4l3LeZ/F9kQHoAoGCCqGSM49
AwEHoUQDQgAEDfrYOR8joFIMX+YoLhtP/u8alFrIKBMbrV0S5yllXn+CFMLNxL5i
yxykbzUMgPsocerNEidYkHMx70+h3R7GzQ==
-----END EC PRIVATE KEY-----
cniEnabled: true
networkValidator:
timeout: 30s
==> run.sh <==
kind create cluster --name linkerd --config kind-config.yaml
helm repo add projectcalico https://docs.tigera.io/calico/charts
helm repo add linkerd https://helm.linkerd.io/stable
helm repo update
helm install calico projectcalico/tigera-operator -f calico.yaml --wait
helm install linkerd-crds linkerd/linkerd-crds --wait
helm install linkerd2-cni linkerd/linkerd2-cni --wait
helm install linkerd-control-plane linkerd/linkerd-control-plane -f linkerd.yaml --wait
# Need to run command bellow, or linkerd ends in Init:CrashLoopBackOff
# kubectl rollout restart deployment linkerd-destination linkerd-identity linkerd-proxy-injector
kubectl exec daemonsets/linkerd-cni -- cat /host/etc/cni/net.d/10-calico.conflist
kubectl rollout restart -n calico-system daemonset calico-node
sleep 10
kubectl exec daemonsets/linkerd-cni -- cat /host/etc/cni/net.d/10-calico.conflist
This bug hit our test cluster running as Microsoft managed cluster AKS
Pods was not able to start
Back-off restarting failed container linkerd-network-validator in pod
Calico node pods where newer than the linkerd-cni (Probably because Microsoft upgraded Calico)
$ kubectl get pods -n calico-system
calico-node-c82h8 1/1 Running 0 2d8h
calico-node-kjrzr 1/1 Running 0 2d8h
calico-node-ljtqv 1/1 Running 0 2d8h
calico-node-ngwhl 1/1 Running 0 2d8h
calico-node-xc98n 1/1 Running 0 2d8h
...
$ kubectl get pods -n linkerd-cni
linkerd-cni-chz8c 1/1 Running 0 29d
linkerd-cni-dpwdx 1/1 Running 0 29d
linkerd-cni-lkwhw 1/1 Running 0 25d
linkerd-cni-pqj5k 1/1 Running 0 29d
$ kubectl exec -it calico-node-kjrzr -- cat /host/etc/cni/net.d/10-calico.conflist
{
"name": "k8s-pod-network",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "calico",
"datastore_type": "kubernetes",
"mtu": 0,
"nodename_file_optional": false,
"log_level": "Info",
"log_file_path": "/var/log/calico/cni/cni.log",
"ipam": { "type": "host-local", "subnet": "usePodCidr"},
"container_settings": {
"allow_ip_forwarding": true
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"k8s_api_root":"https://<sensitive>:443",
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "bandwidth",
"capabilities": {"bandwidth": true}
},
{"type": "portmap", "snat": true, "capabilities": {"portMappings": true}}
]
}
While we're also affected, it's not really a linkerd issue. Linkerd cni insertion works properly. It's calico that mindlessly overrides the whole CNI config on startup removing linkerd in the process.
Edit. My bad, while Calico is to blame, linkerd is already working around multiple CNI implementations doing the same thing.
@alpeb We also have a similar issue, Linkerd CNI(stable-2.14.1) doesn't detect the changes done by Calico in AKS (1.25.6) when Calico pod get restarted. I tried to catch the event happens in the directory when calico pod gets killed and restarted on the node and I see modify
but no CREATE and DELETE, so I would assume that would be the fix.
root@linkerd-cni-9h2rc:/linkerd $ inotifywait -m /host/etc/cni/net.d
Setting up watches.
Watches established.
/host/etc/cni/net.d/ OPEN,ISDIR
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR
/host/etc/cni/net.d/ MODIFY calico-kubeconfig
/host/etc/cni/net.d/ OPEN calico-kubeconfig
/host/etc/cni/net.d/ MODIFY calico-kubeconfig
/host/etc/cni/net.d/ CLOSE_WRITE,CLOSE calico-kubeconfig
/host/etc/cni/net.d/ OPEN,ISDIR
/host/etc/cni/net.d/ ACCESS,ISDIR
/host/etc/cni/net.d/ MODIFY 10-calico.conflist
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS,ISDIR
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR
/host/etc/cni/net.d/ MODIFY 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_WRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN,ISDIR
/host/etc/cni/net.d/ ACCESS,ISDIR
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN,ISDIR
/host/etc/cni/net.d/ ACCESS,ISDIR
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN,ISDIR
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR
/host/etc/cni/net.d/ MODIFY calico-kubeconfig
/host/etc/cni/net.d/ OPEN calico-kubeconfig
/host/etc/cni/net.d/ MODIFY calico-kubeconfig
/host/etc/cni/net.d/ CLOSE_WRITE,CLOSE calico-kubeconfig
/host/etc/cni/net.d/ OPEN,ISDIR
/host/etc/cni/net.d/ ACCESS,ISDIR
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
/host/etc/cni/net.d/ OPEN,ISDIR
/host/etc/cni/net.d/ ACCESS,ISDIR
/host/etc/cni/net.d/ ACCESS,ISDIR
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE,ISDIR
/host/etc/cni/net.d/ OPEN 10-calico.conflist
/host/etc/cni/net.d/ ACCESS 10-calico.conflist
/host/etc/cni/net.d/ CLOSE_NOWRITE,CLOSE 10-calico.conflist
I've modified install-cni.sh
to include modify
in observed events and to try updating config for MODIFY
event. Seems to fix the issue.
The following kustomize patch was used on the chart v30.11.0:
- op: add
path: /spec/template/spec/containers/0/args
value:
- bash
- -c
- |
sed s/create,delete,moved_to/create,delete,moved_to,modify/ install-cni.sh > /tmp/install-cni.sh;
sed -i -E 's@(= .CREATE.) (\] \|\| \[)@\1 \2 \"\$ev\" = "MODIFY" \2 @' /tmp/install-cni.sh;
chmod +x /tmp/install-cni.sh;
/tmp/install-cni.sh
Small note: I'm not sure where to look for the current version of install-cni.sh
, the one I can find differs (doesn't include MOVED_TO
watches) and seems to be an earlier version.
@Agalin nice! Thanks for giving it a try. You can find the up-to-date script in our proxy-init repo. We're in the midst of a clean-up so both repos have the logic but the proxy-init is the one where we actively develop it.
We are also effected by this on AKS, would upgrading to v30.11.0 chart solve the issue?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
This issue is affecting us as well - is there a plan to release a new version with a fix? (seen this has just been marked as 'wontfix')
@jonathansloman, We fixed it by using the repair controller.
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: linkerd-cni
namespace: linkerd-cni
spec:
interval: 5m
dependsOn:
- name: linkerd-crds
namespace: linkerd
chart:
spec:
chart: linkerd2-cni
version: 30.12.2
sourceRef:
kind: HelmRepository
name: linkerd
namespace: flux-system
interval: 60m
values:
repairController:
enabled: true # this option
We are affected by this issue as well. We are already using the repair-controller. While it seems to fix some race conditions, it does not fix the problem that is described in this issue. In our case, when the Calico DaemonSet is restarted while the linkerd-cni pods are still running, Calico overwrites 10-calico.conflist. But linkerd-cni does not notice it, because it's not listening to "modify" events, and carries on running, while our pods are failing to start up. The repair-controller just restarts our pods (and not the linkerd-cni pods) over and over again, and they keep failing until we manually restart the linkerd-cni DaemonSet. We applied the workaround that @Agalin suggested and it solved the issue for us. Thanks @Agalin! It would be good to have an official fix though.
Just FYI: we have been using this workaround since my last message and I'm not aware of any issues related to it in those 4 months.
What is the issue?
When using Linkerd in CNI mode with Calico, there are still race conditions. I see there are several related issues like #4789 and #4049 that address the issue, and some of the problems are solved.
But there still is race conditions when linkerd-cni starts before calico-node. In my case it happens while installing Calico and Linkerd together with Helmfile (helm). That causes all linkerd-control-operator pods to get stuck in
Init:CrashLoopBackOff
Other cases that I have found more details are when calico-node restarts while linkerd-cni pod are still running. calico will override 10-calico.conflist, but watches in install-cni.sh does not work, because file is only modified. Then linkerd-control-operator pods will get stuck if restarted/redeployed
How can it be reproduced?
Tested with Kind and Kubernetes 1.26.6 Install calico, linkerd-cni and linkerd-control-operator Helm charts installed
linkerd-control-plane values.yaml
Linkerd is not able to recover, unless you manually restart linkerd-cni
Logs, error output, etc
No changes was detected in /host/etc/cni/net.d/10-calico.conflist because it only watches for create and delete https://github.com/linkerd/linkerd2/blob/main/cni-plugin/deployment/scripts/install-cni.sh#L308
output of
linkerd check -o short
linkerd-version
‼ cli is up-to-date is running version 2.13.5 but the latest stable version is 2.13.6 see https://linkerd.io/2.13/checks/#l5d-version-cli for hints
control-plane-version
‼ control plane and cli versions match control plane running stable-2.13.6 but cli running stable-2.13.5 see https://linkerd.io/2.13/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
× control plane proxies are healthy pod "linkerd-destination-854dc5d654-zc9nc" status is Pending see https://linkerd.io/2.13/checks/#l5d-cp-proxy-healthy for hints
Status check results are ×
Environment
Possible solution
It could be solved by adding modify, but not sure what other edge cases that could cause
inotifywait -m "${HOST_CNI_NET}" -e create,delete,modify
Additional context
No response
Would you like to work on fixing this bug?
None