linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.69k stars 1.28k forks source link

Linkerd does not inject proxy containers with custom CNI on AWS #12489

Closed gabbler97 closed 6 months ago

gabbler97 commented 6 months ago

What is the issue?

Linkerd proxy injection does not work with custom CNI (cilium) on AWS EKS clusters.

How can it be reproduced?

Install cilium

 helm list -n kube-system | grep cilium
cilium                          kube-system     4               2024-04-19 12:19:50.727550183 +0000 UTC deployed        cilium-1.15.4                           1.15.4
helm get values cilium -n kube-system
USER-SUPPLIED VALUES:
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: cni-plugin
          operator: NotIn
          values:
          - aws
egressMasqueradeInterfaces: eth0
hubble:
  enabled: true
  relay:
    enabled: true
  ui:
    enabled: true
ipam:
  operator:
    clusterPoolIPv4PodCIDRList:
    - 10.0.0.0/8

Install linkerd

helm list -n linkerd
NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
linkerd-control-plane   linkerd         1               2024-04-22 14:00:07.29744245 +0000 UTC  deployed        linkerd-control-plane-2024.3.5  edge-24.3.5
linkerd-crds            linkerd         5               2024-04-04 11:06:36.932480898 +0000 UTC deployed        linkerd-crds-2024.3.5
helm get values linkerd-control-plane -n linkerd
USER-SUPPLIED VALUES:
disableHeartBeat: true
identity:
  issuer:
    scheme: kubernetes.io/tls
identityTrustAnchorsPEM: |-    
  -----BEGIN CERTIFICATE-----
  $CERT_CONTENT
  -----END CERTIFICATE-----
linkerdVersion: edge-24.3.5
policyController:
  image:
    name: my-artifactory/ghcr-docker-remote/linkerd/policy-controller
    version: edge-24.3.5
profileValidator:
  externalSecret: false
proxy:
  image:
    name: my-artifactory/ghcr-docker-remote/linkerd/proxy
    version: edge-24.3.5
  resources:
    cpu:
      limit: 100m
      request: 50m
    memory:
      limit: 100Mi
      request: 40Mi
proxyInit:
  image:
    name: my-artifactory/ghcr-docker-remote/linkerd/proxy-init
    version: v2.2.4
  runAsRoot: false

Annotate the namespace for automatic injection

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    config.linkerd.io/proxy-await: enabled
    linkerd.io/inject: enabled
...

Delete the pods

k get pod -n goldilocks
NAME                                     READY   STATUS    RESTARTS   AGE
goldilocks-controller-7869c48649-nqwkl   1/1     Running   0          50m
goldilocks-dashboard-75df58d594-49cj2    1/1     Running   0          50m
goldilocks-dashboard-75df58d594-zgw6v    1/1     Running   0          50m
user@ip-10-x-x-65 ~ $ k delete pod --all -n goldilocks
pod "goldilocks-controller-7869c48649-nqwkl" deleted
pod "goldilocks-dashboard-75df58d594-49cj2" deleted
pod "goldilocks-dashboard-75df58d594-zgw6v" deleted
user@ip-10-x-x-65 ~ $ k get pod -n goldilocks
NAME                                     READY   STATUS    RESTARTS   AGE
goldilocks-controller-7869c48649-vq5g2   1/1     Running   0          8s
goldilocks-dashboard-75df58d594-jdrnm    1/1     Running   0          6s
goldilocks-dashboard-75df58d594-ppxjm    1/1     Running   0          8s

Sidecar proxy should be injected and the last output should be

k get pod -n goldilocks
NAME                                     READY   STATUS    RESTARTS   AGE
goldilocks-controller-7869c48649-vq5g2   2/2     Running   0          8s
goldilocks-dashboard-75df58d594-jdrnm    2/2     Running   0          6s
goldilocks-dashboard-75df58d594-ppxjm    2/2     Running   0          8s

Logs, error output, etc

https://gist.github.com/gabbler97/6734dc908cf7136df49a8d2ba5e67eb9

output of linkerd check -o short

linkerd check -o short
linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2024-04-25T05:51:39Z
    see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.13.4 but the latest stable version is 2.14.10
    see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.3.5 but the latest edge version is 24.4.4
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.3.5 but cli running stable-2.13.4
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-c6595f85b-b9tlz (edge-24.3.5)
        * linkerd-identity-6bfcf4bf97-cr8km (edge-24.3.5)
        * linkerd-proxy-injector-59d7d485b-crbgj (edge-24.3.5)
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-c6595f85b-b9tlz running edge-24.3.5 but cli running stable-2.13.4
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ linkerd-viz pods are injected
    could not find proxy container for metrics-api-5bd869c749-6vqmt pod
    see https://linkerd.io/2.13/checks/#l5d-viz-pods-injection for hints
‼ viz extension pods are running
    container "linkerd-proxy" in pod "metrics-api-5bd869c749-6vqmt" is not ready
    see https://linkerd.io/2.13/checks/#l5d-viz-pods-running for hints
‼ viz extension proxies are healthy
    no "linkerd-proxy" containers found in the "linkerd" namespace
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-healthy for hints

Status check results are √

Environment

Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.11-eks-b9c9ed7

Possible solution

No response

Additional context

I have tried the linkerd-proxy-injector with hostNetwork=true. In this case the proxy sidecar containers are injected automatically after a deployment rollout. Some nodes became not ready because the kubelet stopped sending status. After a given time (10 minutes it has benn resolved automatically). My pods which are interacting with the kube API server started to crashloopbackoff, but only on one specific node at a time (where the linkerd proxy injector pod was running):

k get pod -A -o wide | grep "("
backup                  node-agent-dh6tm                                             0/1     CrashLoopBackOff   6 (43s ago)    10m     172.24.2.7      ip-10-x-x-162   <none>           <none>
monitoring              datadog-jwgx6                                                3/4     Running            5 (73s ago)    10m     172.24.2.83     ip-10-x-x-162   <none>           <none>
storage-ebs             ebs-csi-node-sfkx6                                           1/3     CrashLoopBackOff   8 (21s ago)    4m25s   172.24.2.163    ip-10-x-x-162   <none>           <none>
storage-fsx             fsx-openzfs-csi-node-2tkw7                                   1/3     CrashLoopBackOff   12 (71s ago)   9m25s   172.24.2.251    ip-10-x-x-162   <none>           <none>

Inside the pod logs I have found timeout for api server requests

k logs ebs-csi-node-775zv -n storage-ebs
Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe
I0405 08:52:12.308665       1 driver.go:83] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.28.0"
I0405 08:52:12.308784       1 node.go:93] "regionFromSession Node service" region="eu-central-1"
I0405 08:52:12.308809       1 metadata.go:85] "retrieving instance data from ec2 metadata"
I0405 08:52:24.870306       1 metadata.go:88] "ec2 metadata is not available"
I0405 08:52:24.870333       1 metadata.go:96] "retrieving instance data from kubernetes api"
I0405 08:52:24.871040       1 metadata.go:101] "kubernetes api is available"
panic: error getting Node ip-10-x-x-77.eu-central-1.compute.internal: Get "https://172.20.0.1:443/api/v1/nodes/ip-10-x-x-77": dial tcp 172.20.0.1:443: i/o timeout

goroutine 1 [running]:
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newNodeService(0xc00041cfc0)
        /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/node.go:96 +0x3b1
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver({0xc000477ec0, 0xd, 0x4?})
        /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:106 +0x3e6
main.main()
        /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:64 +0x595

Would you like to work on fixing this bug?

None

alpeb commented 6 months ago

Before attempting to use host networking, can you post the events (kubectl describe) for the deployments (not the pods) after rolling them out to see if there's any info about why they didn't get injected? Also the events for the injector pod and its logs might prove to be useful.

gabbler97 commented 6 months ago

Thank you for your answer @alpeb !

user@ip-10-x-x-65 ~ $ k logs linkerd-proxy-injector-55f86f4fc9-tsmgc  -n linkerd
Defaulted container "linkerd-proxy" out of: linkerd-proxy, proxy-injector, linkerd-init (init)
[     0.095648s]  INFO ThreadId(01) linkerd2_proxy: release 2.224.0 (d91421a) by linkerd on 2024-03-28T18:07:05Z
[     0.099989s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.101281s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.101298s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.101302s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.101305s]  INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[     0.101309s]  INFO ThreadId(01) linkerd2_proxy: SNI is linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.101312s]  INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.101315s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.104250s]  INFO ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_pool_p2c: Adding endpoint addr=10.0.2.118:8090
[     0.195414s]  INFO ThreadId(01) dst:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}: linkerd_pool_p2c: Adding endpoint addr=10.0.2.118:8086
[     0.202508s]  INFO ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_pool_p2c: Adding endpoint addr=10.0.31.152:8080
[     0.315761s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=linkerd-proxy-injector.linkerd.serviceaccount.identity.linkerd.cluster.local
user@ip-10-x-x-65 ~ $ k logs linkerd-proxy-injector-55f86f4fc9-tsmgc  -n linkerd -c proxy-injector
time="2024-04-25T11:25:20Z" level=info msg="running version edge-24.3.5"
time="2024-04-25T11:25:20Z" level=info msg="starting admin server on :9995"
time="2024-04-25T11:25:20Z" level=info msg="waiting for caches to sync"
time="2024-04-25T11:25:20Z" level=info msg="listening at :8443"
time="2024-04-25T11:25:20Z" level=info msg="caches synced"
user@ip-10-x-x-65 ~ $ k logs linkerd-proxy-injector-55f86f4fc9-tsmgc  -n linkerd -c linkerd-init
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy-save -t nat"
time="2024-04-25T11:25:12Z" level=info msg="# Generated by iptables-save v1.8.10 on Thu Apr 25 11:25:12 2024\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\nCOMMIT\n# Completed on Thu Apr 25 11:25:12 2024\n"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -N PROXY_INIT_REDIRECT"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp --match multiport --dports 4190,4191,4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4190,4191,4567,4568/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp -j REDIRECT --to-port 4143 -m comment --comment proxy-init/redirect-all-incoming-to-proxy-port/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PREROUTING -j PROXY_INIT_REDIRECT -m comment --comment proxy-init/install-proxy-init-prerouting/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -N PROXY_INIT_OUTPUT"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -m owner --uid-owner 2102 -j RETURN -m comment --comment proxy-init/ignore-proxy-user-id/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -o lo -j RETURN -m comment --comment proxy-init/ignore-loopback/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -p tcp --match multiport --dports 443,6443 -j RETURN -m comment --comment proxy-init/ignore-port-443,6443/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -p tcp -j REDIRECT --to-port 4140 -m comment --comment proxy-init/redirect-all-outgoing-to-proxy-port/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy -t nat -A OUTPUT -j PROXY_INIT_OUTPUT -m comment --comment proxy-init/install-proxy-init-output/1714044312"
time="2024-04-25T11:25:12Z" level=info msg="/sbin/iptables-legacy-save -t nat"
time="2024-04-25T11:25:12Z" level=info msg="# Generated by iptables-save v1.8.10 on Thu Apr 25 11:25:12 2024\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\n:PROXY_INIT_OUTPUT - [0:0]\n:PROXY_INIT_REDIRECT - [0:0]\n-A PREROUTING -m comment --comment \"proxy-init/install-proxy-init-prerouting/1714044312\" -j PROXY_INIT_REDIRECT\n-A OUTPUT -m comment --comment \"proxy-init/install-proxy-init-output/1714044312\" -j PROXY_INIT_OUTPUT\n-A PROXY_INIT_OUTPUT -m owner --uid-owner 2102 -m comment --comment \"proxy-init/ignore-proxy-user-id/1714044312\" -j RETURN\n-A PROXY_INIT_OUTPUT -o lo -m comment --comment \"proxy-init/ignore-loopback/1714044312\" -j RETURN\n-A PROXY_INIT_OUTPUT -p tcp -m multiport --dports 443,6443 -m comment --comment \"proxy-init/ignore-port-443,6443/1714044312\" -j RETURN\n-A PROXY_INIT_OUTPUT -p tcp -m comment --comment \"proxy-init/redirect-all-outgoing-to-proxy-port/1714044312\" -j REDIRECT --to-ports 4140\n-A PROXY_INIT_REDIRECT -p tcp -m multiport --dports 4190,4191,4567,4568 -m comment --comment \"proxy-init/ignore-port-4190,4191,4567,4568/1714044312\" -j RETURN\n-A PROXY_INIT_REDIRECT -p tcp -m comment --comment \"proxy-init/redirect-all-incoming-to-proxy-port/1714044312\" -j REDIRECT --to-ports 4143\nCOMMIT\n# Completed on Thu Apr 25 11:25:12 2024\n"

And the events for the deployments

user@ip-10-x-x-65 ~ $ k describe deploy -n linkerd | grep Events
Events:          <none>
Events:          <none>
Events:          <none>
user@ip-10-x-x-65 ~ $ k describe deploy -n goldilocks | grep Events
Events:          <none>
Events:          <none>
alpeb commented 6 months ago

Also, can you post what you get from kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io linkerd-proxy-injector-webhook-config -oyaml?

gabbler97 commented 6 months ago

Yes of course!

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  annotations:
    meta.helm.sh/release-name: linkerd-control-plane
    meta.helm.sh/release-namespace: linkerd
  labels:
    app.kubernetes.io/managed-by: Helm
    linkerd.io/control-plane-component: proxy-injector
    linkerd.io/control-plane-ns: linkerd
  name: linkerd-proxy-injector-webhook-config
webhooks:
- admissionReviewVersions:
  - v1
  - v1beta1
  clientConfig:
    caBundle: $CABUNDLE
    service:
      name: linkerd-proxy-injector
      namespace: linkerd
      path: /
      port: 443
  failurePolicy: Ignore
  matchPolicy: Equivalent
  name: linkerd-proxy-injector.linkerd.io
  namespaceSelector:
    matchExpressions:
    - key: config.linkerd.io/admission-webhooks
      operator: NotIn
      values:
      - disabled
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
      - kube-system
      - cert-manager
  objectSelector:
    matchExpressions:
    - key: linkerd.io/control-plane-component
      operator: DoesNotExist
    - key: linkerd.io/cni-resource
      operator: DoesNotExist
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    - services
    scope: Namespaced
  sideEffects: None
  timeoutSeconds: 10
gabbler97 commented 6 months ago

Any idea how should I continue? Thank you very much in advance!

gabbler97 commented 6 months ago

Any clue? Thank you very much in advance!

adleong commented 6 months ago

Hi @gabbler97! Based on the output from linkerd check, it seems that your control plane is not healthy. Looking more closely at the control plane logs, I do see a lot of failures from the control plane components to connect to each other. I'd suggest using Cilium's observability tools (such as Hubble) to ensure that Cilium is allowing traffic between the control plane components.

alpeb commented 6 months ago

FWIW I've successfully tested linkerd with cilium in chained in hybrid mode with AWS VPC CNI, and it worked fine. Looking forward to what you find out about the control plane connectivity issues.

gabbler97 commented 6 months ago

Thank you very much for your help! In the meantime I have found another way to avoid IPv4 exhaustion. In the future if somebody needs it it can be found here: https://aws.github.io/aws-eks-best-practices/networking/custom-networking/