Requests continuously routed to terminating pods - on aws with calico not cilium

rufreakde commented 1 year ago

What is the issue?

Randomly (depending on rolling restarts I assume) the ip tables of linkerd proxies that route to other pods within the service mesh just fail with error. The error is extremly similar to the following issue: https://github.com/linkerd/linkerd2/issues/6238#issuecomment-919058743

But we are not using cilium but calico.

The same setup works without linkerd (linkerd.io/inject: disabled in source/target pod), meaning: the connection is lost and reopened. No request fails.

How can it be reproduced?

-create AWS cluster -install linkerd helm charts

---
kind: Namespace
apiVersion: v1
metadata:
  name: linkerd-cni
  labels:
    linkerd.io/cni-resource: "true"
    config.linkerd.io/admission-webhooks: disabled
    pod-security.kubernetes.io/enforce: privileged
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: linkerd-cni
  namespace: linkerd-cni
  labels:
    linkerd.io/cni-resource: "true"
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: linkerd-cni
  labels:
    linkerd.io/cni-resource: "true"
rules:
  - apiGroups: [""]
    resources: ["pods", "nodes", "namespaces", "services"]
    verbs: ["list", "get", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: linkerd-cni
  labels:
    linkerd.io/cni-resource: "true"
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: linkerd-cni
subjects:
  - kind: ServiceAccount
    name: linkerd-cni
    namespace: linkerd-cni
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: linkerd-cni-config
  namespace: linkerd-cni
  labels:
    linkerd.io/cni-resource: "true"
data:
  dest_cni_net_dir: "/etc/cni/net.d"
  dest_cni_bin_dir: "/opt/cni/bin"
  # The CNI network configuration to install on each node. The special
  # values in this config will be automatically populated.
  cni_network_config: |-
    {
      "name": "linkerd-cni",
      "type": "linkerd-cni",
      "log_level": "info",
      "policy": {
          "type": "k8s",
          "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
          "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
      },
      "kubernetes": {
          "kubeconfig": "__KUBECONFIG_FILEPATH__"
      },
      "linkerd": {
        "incoming-proxy-port": 4143,
        "outgoing-proxy-port": 4140,
        "proxy-uid": 2102,
        "ports-to-redirect": [],
        "inbound-ports-to-ignore": ["4191","4190"],
        "simulate": false,
        "use-wait-flag": false
      }
    }
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: linkerd-cni
  namespace: linkerd-cni
  labels:
    k8s-app: linkerd-cni
    linkerd.io/cni-resource: "true"
  annotations:
    linkerd.io/created-by: linkerd/cli stable-2.13.5
    kube-score/ignore: container-security-context-user-group-id,container-resources,container-image-pull-policy,container-ephemeral-storage-request-and-limit,pod-networkpolicy
spec:
  selector:
    matchLabels:
      k8s-app: linkerd-cni
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        k8s-app: linkerd-cni
      annotations:
        linkerd.io/created-by: linkerd/cli stable-2.13.5
        linkerd.io/cni-resource: "true"
        linkerd.io/inject: disabled
    spec:
      tolerations:
        - operator: Exists
      nodeSelector:
        kubernetes.io/os: linux
      hostNetwork: true
      securityContext:
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: linkerd-cni
      containers:
        # This container installs the linkerd CNI binaries
        # and CNI network config file on each node. The install
        # script copies the files into place and then sleeps so
        # that Kubernetes doesn't keep trying to restart it.
        - name: install-cni
          image: cr.l5d.io/linkerd/cni-plugin:v1.1.1
          imagePullPolicy:
          env:
            - name: DEST_CNI_NET_DIR
              valueFrom:
                configMapKeyRef:
                  name: linkerd-cni-config
                  key: dest_cni_net_dir
            - name: DEST_CNI_BIN_DIR
              valueFrom:
                configMapKeyRef:
                  name: linkerd-cni-config
                  key: dest_cni_bin_dir
            - name: CNI_NETWORK_CONFIG
              valueFrom:
                configMapKeyRef:
                  name: linkerd-cni-config
                  key: cni_network_config
            - name: SLEEP
              value: "true"
          lifecycle:
            # In some edge-cases this helps ensure that cleanup() is called in the container's script
            # https://github.com/linkerd/linkerd2/issues/2355
            preStop:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - kill -15 1; sleep 15s
          volumeMounts:
            - mountPath: /host/opt/cni/bin
              name: cni-bin-dir
            - mountPath: /host/etc/cni/net.d
              name: cni-net-dir
            - mountPath: /tmp
              name: linkerd-tmp-dir
          securityContext:
            readOnlyRootFilesystem: true
            privileged: false
          resources:
      volumes:
        - name: cni-bin-dir
          hostPath:
            path: /opt/cni/bin
        - name: cni-net-dir
          hostPath:
            path: /etc/cni/net.d
        - name: linkerd-tmp-dir
          emptyDir: {}
---

---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: linkerd-control-plane
  namespace: argocd
  annotations:
    notifications.argoproj.io/subscribe.teams: ms-cluster-scoped
    argocd.argoproj.io/sync-wave: "1"
    argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
spec:
  project: default
  source:
    chart: linkerd-control-plane
    repoURL: https://helm.linkerd.io/stable
    targetRevision: 1.12.5
    helm:
      releaseName: linkerd-control-plane
      values: |
        # -- Create PodDisruptionBudget resources for each control plane workload
        enablePodDisruptionBudget: true

        # -- Specify a deployment strategy for each control plane workload
        deploymentStrategy:
          rollingUpdate:
            maxUnavailable: 1
            maxSurge: 25%

        # -- add PodAntiAffinity to each control plane workload (only for production with at least 4 nodes)
        enablePodAntiAffinity: false

        # nodeAffinity:

        # proxy configuration
        proxy:
          resources:
            cpu:
              request: 50m
            memory:
              limit: 250Mi
              request: 20Mi

        # controller configuration
        controllerReplicas: 2
        controllerResources:
          cpu:
            limit: ""
            request: 50m
          memory:
            limit: 250Mi
            request: 50Mi
        destinationResources:
          cpu:
            limit: ""
            request: 50m
          memory:
            limit: 250Mi
            request: 50Mi

        # identity configuration
        identityResources:
          cpu:
            limit: ""
            request: 50m
          memory:
            limit: 250Mi
            request: 10Mi

        # heartbeat configuration
        heartbeatResources:
          cpu:
            limit: ""
            request: 50m
          memory:
            limit: 250Mi
            request: 10Mi

        # proxy injector configuration
        proxyInjectorResources:
          cpu:
            limit: ""
            request: 50m
          memory:
            limit: 250Mi
            request: 10Mi

        webhookFailurePolicy: Ignore

        # service profile validator configuration
        spValidatorResources:
          cpu:
            limit: ""
            request: 50m
          memory:
            limit: 250Mi
            request: 10Mi

        # Public certificate
        identityTrustAnchorsPEM: |
          -----BEGIN CERTIFICATE-----
          ...
          -----END CERTIFICATE-----

        # TLS identity
        identity:
          issuer:
            scheme: kubernetes.io/tls

        # Allow proxy startup plugin instead of init container (needed for SecurityPodAdmission)
        cniEnabled: true
  destination:
    server: https://kubernetes.default.svc
    namespace: linkerd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - Validate=true
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - PruneLast=true
      - ApplyOutOfSync=true
---
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: linkerd-crds
  namespace: argocd
  annotations:
    notifications.argoproj.io/subscribe.teams: ms-cluster-scoped
    argocd.argoproj.io/sync-wave: "0"
spec:
  project: default
  source:
    chart: linkerd-crds
    repoURL: https://helm.linkerd.io/stable
    targetRevision: 1.6.1
    helm:
      releaseName: linkerd-crds
  destination:
    server: https://kubernetes.default.svc
    namespace: linkerd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - Validate=false
      - Replace=true
  ignoreDifferences:
    - group: apiextensions.k8s.io
      kind: CustomResourceDefinition
      jsonPointers:
        - /spec/names/shortNames
---
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: linkerd-viz
  namespace: argocd
  generateName: hook-linkerd-viz-
  annotations:
    notifications.argoproj.io/subscribe.teams: ms-cluster-scoped
    argocd.argoproj.io/sync-wave: "5"
    argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
    argocd.argoproj.io/hook: PostSync
spec:
  project: default
  source:
    chart: linkerd-viz
    repoURL: https://helm.linkerd.io/stable
    targetRevision: 30.8.5
    helm:
      releaseName: linkerd-viz
      parameters:
        - name: prometheus.enabled
          value: "false"
        - name: linkerdNamespace
          value: linkerd
        - name: prometheusUrl
          value: http://up-monitoring-kube-prometh-prometheus.up-monitoring.svc.cluster.local:9090
  destination:
    server: https://kubernetes.default.svc
    namespace: linkerd-viz
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - Validate=true
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - PruneLast=true
      - ApplyOutOfSync=true

-wait until some "sink" pod gets a rolling update restart. A pod that reaches this pod will fail. (not happening everytime though)

Logs, error output, etc

Upstream Callee POD log: Sometimes 111 error sometimes recovers sometimes outgoing error not very consistent. But it started to happen when we migrated from init container to cni plugin.

Downstream(sink) POD log: (parse_sni failed)

linkerd-proxy [ 7398.699944s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_tls::server: Peeked bytes from TCP stream sz=0
linkerd-proxy [ 7398.699972s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_tls::server: Attempting to buffer TLS ClientHello after incomplete peek
linkerd-proxy [ 7398.699989s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_tls::server: Reading bytes from TCP stream buf.capacity=8192
linkerd-proxy [ 7398.700007s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_tls::server: Read bytes from TCP stream buf.len=108
linkerd-proxy [ 7398.700022s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_tls::server::client_hello: parse_sni: failed to parse up to SNI
linkerd-proxy [ 7398.700059s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_detect: Starting protocol detection capacity=1024 timeout=1s
linkerd-proxy [ 7398.700086s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_proxy_http::detect: Reading capacity=1024
linkerd-proxy [ 7398.700485s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_proxy_http::detect: Read sz=108
linkerd-proxy [ 7398.700537s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_proxy_http::detect: Checking H2 preface
linkerd-proxy [ 7398.700556s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_proxy_http::detect: Parsing HTTP/1 message
linkerd-proxy [ 7398.700574s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_proxy_http::detect: Matched HTTP/1
linkerd-proxy [ 7398.700590s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_detect: DetectResult protocol=Some(Http1) elapsed=505.827µs
linkerd-proxy [ 7398.700615s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_detect: Dispatching connection
linkerd-proxy [ 7398.700633s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_proxy_http::server: Creating HTTP service version=Http1
linkerd-proxy [ 7398.700653s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_proxy_http::server: Handling as HTTP version=Http1
linkerd-proxy [ 7398.700697s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_http_route: Finding matching route routes=2
linkerd-proxy [ 7398.700722s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_http_route: hosts=[]
linkerd-proxy [ 7398.700738s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_http_route: rules=1
linkerd-proxy [ 7398.700754s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_http_route: matches!
linkerd-proxy [ 7398.700769s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_http_route: hosts=[]
linkerd-proxy [ 7398.700784s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_http_route: rules=1
linkerd-proxy [ 7398.700799s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_http_route: matches!
linkerd-proxy [ 7398.700814s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_app_inbound::policy::http: Request authorized server.group= server.kind=default server.name=all-unauthenticated route.group= route.kind=default route.name=probe authz.group= authz.kind=default authz.name=probe client.tls=None(NoClientHello) client.ip=10.250.6.92
linkerd-proxy [ 7398.700905s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
linkerd-proxy [ 7398.700954s] TRACE ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_detect: Connection completed
linkerd-proxy [ 7398.700975s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=10.250.6.92:56518}: linkerd_app_core::serve: Connection closed

output of `linkerd check -o short`

linkerd check -o short
linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2023-08-06T03:16:29Z
    see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

Status check results are √

Environment

kubernetes version 1.25.10

linkerd version

linkerd version
Client version: stable-2.13.5
Server version: stable-2.13.5

hyperscaler: aws

node: (linux kernel 5.15)

k get nodes -owide
NAME                     STATUS   ROLES    AGE   VERSION    INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION                     CONTAINER-RUNTIME
XXXXX.compute.internal   Ready    <none>   18d   v1.24.14   10.250.27.104   <none>        Garden Linux 934.9   5.15.114-gardenlinux-cloud-amd64   containerd://1.6.20
XXXXX.compute.internal   Ready    <none>   18d   v1.24.14   10.250.30.122   <none>        Garden Linux 934.9   5.15.114-gardenlinux-cloud-amd64   containerd://1.6.20

Possible solution

We saw this comment here: https://github.com/linkerd/linkerd2/issues/6238#issuecomment-919058743 But we are totally not sure how this is meant and if the configuration there is complete and even if this is the same issue? But the problem with 504 outgoing issue appears consistently.

Any idea what we need to check to understand the problem in more detail? The flakiness is very problematic.

Additional context

No response

Would you like to work on fixing this bug?

no

mateiidavid commented 1 year ago

Hey @rufreakde thanks for filing this! Will try to dissect the problem first.

Randomly (depending on rolling restarts I assume) the ip tables of linkerd proxies that route to other pods within the service mesh just fail with error. The error is extremly similar to the following issue

Hm. Our iptables stack doesn't really handle routing to other pods. Unless iptables errors out, then the assumption is that it works to do what it is supposed to do: redirect packets through the proxy. The issue you linked (https://github.com/linkerd/linkerd2/issues/6238#issuecomment-919058743) describes a different problem that some CNI implementations tend to have.

A CNI implementation that uses eBPF can sometimes take control of your routing at a much lower level, load balancing in these cases is done at TCP accept() time (i.e. socket lb). Unfortunately, there isn't really a good way to solve this, there's no way to interop with this at a higher level -- you're stuck with the decision the eBPF based impl made on your behalf. However, there is an easy way out. You can just disable this functionality and let Linkerd do the load balancing at a higher level in the stack.

I wrote about this in a doc page. The instructions apply to Cilium but I imagine Calico has a similar setting. It's important to note that CNI implementations that do this are supposed to short-circuit the routing decision with the main assumption being that kube-proxy is the one doing the routing. This shouldn't have an impact on other CNI features working well (i.e. you can safely disable it afaik).

I'm not sure how to interpret your logs. What's the relationship between connection failures and CNIs doing socket level load balancing? Perhaps that's another avenue we can go down on to understand what the problem is.

Edit: sorry for the close/re-open, I/O difficulties on my part :)

bjoernw commented 1 year ago

Seems like the default setting in calico is IPTables which afaik means calico isn't getting into the middle of load balancing decisions https://docs.tigera.io/calico/latest/reference/installation/api#operator.tigera.io/v1.CalicoNetworkSpec

This describes what calico would do in the event you wanted it to make load balancing decisions: https://docs.tigera.io/calico/latest/about/kubernetes-training/about-kubernetes-services#calico-ebpf-native-service-handling

mateiidavid commented 1 year ago

Seems like the default setting in calico is IPTables which afaik means calico isn't getting into the middle of load balancing decisions https://docs.tigera.io/calico/latest/reference/installation/api#operator.tigera.io/v1.CalicoNetworkSpec

Unless BPF is used :) Which would be great to confirm here.

This describes what calico would do in the event you wanted it to make load balancing decisions: https://docs.tigera.io/calico/latest/about/kubernetes-training/about-kubernetes-services#calico-ebpf-native-service-handling

Nice! I also like this blog post they've written: Calico eBPF data plane deep dive. I think it's a bit more in-depth and helps to demystify what is happening behind the scenes (although it is a longer read).

Anyway, eBPF is a bit of a digression. Perhaps you can help me out by letting me know why you think this looks similar to https://github.com/linkerd/linkerd2/issues/6238#issuecomment-919058743, is it just the error that looks similar? That issue in particular was about bpf, hence why I'm making this association.

Proxy logs from your client/server would also be helpful here.

rufreakde commented 1 year ago

Hi all, sorry for my late reply I will need some time to read through the sources shared here. Also I will need to check with our central infrastructure team regarding the caliclo eBPF configuration. We are using on our side only the networkpolicies and they work fine so need to recheck that.

I will also checkout the doc but I think the other information about calico is more important.

For the logs: These where the logs for the proxy "server" the proxy "client" logs I would need to try to find them again will try to get the stuff ready after the readup. Thanks for all the help!

rufreakde commented 1 year ago

Okay Update from our side. We were able to identify the issue within the CNI plugin. So we switched to init containers and got a security exception for privileged containers. It seems the CNI plugin (at least on AWS) is not stable enough yet.

I was not able to get "other" logs. Most logs look arbitrary 111 errors nothing more.

mateiidavid commented 1 year ago

@rufreakde thanks for coming back with an answer! So, as I understand, your issue has been solved?

We were able to identify the issue within the CNI plugin

Is there anything in particular that pointed to the CNI plugin being at fault? And for my understanding, by CNI plugin you mean linkerd's CNI plugin?

rufreakde commented 1 year ago

@rufreakde thanks for coming back with an answer! So, as I understand, your issue has been solved?

We were able to identify the issue within the CNI plugin

Is there anything in particular that pointed to the CNI plugin being at fault? And for my understanding, by CNI plugin you mean linkerd's CNI plugin?

Yes after disabling CNI removing it from the cluster and using init-containers the issues do not appear anymore. (so far so good)

Yes, we used the linkerD CNI plugin before. So now in our helm chart we have cniEnabled=false.

@mateiidavid just in general how did we assume it is the CNI plugin:

randomness of non working network
not many information in logs
egress 111 (found other issues which also in the end where cni related)
appeared after we switched to CNI because of security compliance.

mateiidavid commented 1 year ago

Thanks for the information! Since you managed to find a workaround, I'll be closing this issue. Let us know if you want it re-opened.

linkerd / linkerd2