linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.48k stars 1.26k forks source link

Linkerd CNI with EKS 1.26 and VPC CNI not starting Pods (forbidden) #11286

Closed patrickdomnick closed 6 months ago

patrickdomnick commented 10 months ago

What is the issue?

When using Linkerd in CNI mode with EKS 1.26 and the VPC CNI, the linkerd control plane is not able to start. This issue is present for all LinkerD Pods. Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "20ecb2f55eed7f8f826624c5f722b879dd9a76a03862d71bc5606724b39ef36a": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Get "https://[172.20.0.1]:443/api/v1/namespaces/linkerd/pods/linkerd-identity-784744bbd9-fz4rb": Forbidden

Other similar issues hinted towards problems that are already fixed or do not apply to use like the /etc/cni/net.d/10-aws.conflist which is correctly chained in our case?

{
  "cniVersion": "0.4.0",
  "name": "aws-cni",
  "disableCheck": true,
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni",
      "mtu": "9001",
      "podSGEnforcingMode": "strict",
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "DEBUG"
    },
    {
      "name": "egress-cni",
      "type": "egress-cni",
      "mtu": "9001",
      "enabled": "false",
      "randomizeSNAT": "prng",
      "nodeIP": "",
      "ipam": {
        "type": "host-local",
        "ranges": [
          [
            {
              "subnet": "fd00::ac:00/118"
            }
          ]
        ],
        "routes": [
          {
            "dst": "::/0"
          }
        ],
        "dataDir": "/run/cni/v4pd/egress-v6-ipam"
      },
      "pluginLogFile": "/var/log/aws-routed-eni/egress-v6-plugin.log",
      "pluginLogLevel": "DEBUG"
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      },
      "snat": true
    },
    {
      "name": "linkerd-cni",
      "type": "linkerd-cni",
      "log_level": "info",
      "policy": {
        "type": "k8s",
        "k8s_api_root": "[https://172.20.0.1:443 ](https://172.20.0.1/)",
        "k8s_auth_token": "ey..."
      },
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig"
      },
      "linkerd": {
        "incoming-proxy-port": 4143,
        "outgoing-proxy-port": 4140,
        "proxy-uid": 2102,
        "ports-to-redirect": [],
        "inbound-ports-to-ignore": [
          "4191",
          "4190"
        ],
        "simulate": false,
        "use-wait-flag": false
      }
    }
  ]
}

How can it be reproduced?

Install Linkerd with CNI Mode Enabled on a EKS 1.26 Cluster with the VPC CNI enabled:

linkerd install --crds | kubectl apply -f -
linkerd install-cni | kubectl apply -f -
linkerd install --linkerd-cni-enabled | kubectl apply -f -

Logs, error output, etc

VPC CNI

Installed /host/opt/cni/bin/aws-cni
Installed /host/opt/cni/bin/egress-cni
time="2023-08-24T12:40:40Z" level=info msg="Starting IPAM daemon... "
time="2023-08-24T12:40:40Z" level=info msg="Checking for IPAM connectivity... "
time="2023-08-24T12:40:41Z" level=info msg="Copying config file... "
time="2023-08-24T12:40:41Z" level=info msg="Successfully copied CNI plugin binary and config file."

Linkerd CNI

[2023-08-24 13:13:40] Wrote linkerd CNI binaries to /host/opt/cni/bin
[2023-08-24 13:13:41] Installing CNI configuration for /host/etc/cni/net.d/10-aws.conflist
[2023-08-24 13:13:41] Using CNI config template from CNI_NETWORK_CONFIG environment variable.
      "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
      "k8s_api_root": "https://172.20.0.1:__KUBERNETES_SERVICE_PORT__",
[2023-08-24 13:13:41] CNI config: {
  "name": "linkerd-cni",
  "type": "linkerd-cni",
  "log_level": "info",
  "policy": {
      "type": "k8s",
      "k8s_api_root": "https://172.20.0.1:443",
      "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
  },
  "kubernetes": {
      "kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig"
  },

output of linkerd check -o short

kubernetes-api
-----------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
- No running pods for "linkerd-destination" 

does not finish...

Environment

Possible solution

No response

Additional context

We worked with this setup before (EKS 1.23) and are now directly upgrading to EKS 1.26. We did not change any of the VPC CNI Configuration since then, so I am assuming it might has to do something with the switch to containerd or some new defaults which we are not aware of.

Would you like to work on fixing this bug?

None

mateiidavid commented 10 months ago

Hi @patrickdomnick! Are you able to look at the kubelet logs at all to see why the CNI plugin itself might be failing? Both VPC & Linkerd CNI logs you posted above are for the installers. It's likely that whatever error is being encountered is in the plugin executable. It might help us have a better idea of why the sandbox can't be created.

Centro1993 commented 9 months ago

Hello @mateiidavid, I am @patrickdomnick 's Co-Worker, and I managed to fix the problem in his absence 🥳 The LinkerD-CNI Logs told us that there was an Issue when the LinkerD-CNI Executable on the Node itself failed when calling the K8s-API to get the current Pod's Information. It failed with a "403 FORBIDDEN", which we assumed to be the result of incorrect K8s Credentials, but testing the Kubeconfig on a Pod showed us that the Config was valid. After a deep deep dive including writing a Wrapper for the CNI-Executable, we noticed that we were missing a NO_PROXY entry on our K8s-Node for the K8s-API - the "403 FORBIDDEN" came from our Squid Proxy, not from K8s.

We will fix this by adding the entry on our Nodes. It would be sweet if it were possible to set Env-Vars in the Helmfile, which the LInkerD-CNI Pod would pass on to the LinkerD-CNI Executable on the Node, if possible.

stale[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.