linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.47k stars 1.26k forks source link

Policy controller fails to watch resources. #12755

Open xolott opened 1 week ago

xolott commented 1 week ago

What is the issue?

The policy controller fails when it tries to watch some k8s resources (I think all of them). There is not a single package dropped according to cilium (I used hubble to check this) but the controller says the connection was dropped. Using curl within the container I can make the same GET request to the API server and get a response, so the CNI is not dropping this connection.

$ curl -v --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&wat
ch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531

* Request completely sent off
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
< HTTP/2 200 
< audit-id: 568e2098-a3a2-4216-9503-3a8b7cdb1050
< cache-control: no-cache, private
< content-type: application/json
< x-kubernetes-pf-flowschema-uid: f2596b47-c5e3-439f-a6e7-a84503cb1560
< x-kubernetes-pf-prioritylevel-uid: c65e2c9b-924d-4ec0-848a-49025cdd9c73
< content-length: 165
< date: Wed, 19 Jun 2024 22:33:07 GMT
< 
{
  "apiVersion": "policy.linkerd.io/v1beta3",
  "items": [],
  "kind": "HTTPRouteList",
  "metadata": {
    "continue": "",
    "resourceVersion": "226670468"
  }
* Connection #0 to host 10.0.0.1 left intact
}

How can it be reproduced?

Using terraform:

resource "helm_release" "cilium" {
  name       = "cilium"
  repository = "https://helm.cilium.io/"
  chart      = "cilium"
  version    = var.cilium_version
  namespace  = "kube-system"

  set {
    name  = "aksbyocni.enabled"
    value = true
  }

  set {
    name  = "nodeinit.enabled"
    value = true
  }

  set {
    name  = "hubble.relay.enabled"
    value = true
  }

  set {
    name  = "hubble.ui.enabled"
    value = true
  }
}

resource "kubernetes_namespace" "linkerd_namespace" {
  metadata {
    name = "linkerd"
  }
}
resource "kubernetes_namespace" "linkerd_viz_namespace" {
  metadata {
    name = "linkerd-viz"
    labels = {
      "linkerd.io/extension" = "viz"
    }
  }
}

resource "kubernetes_namespace" "linkerd_cni_namespace" {
  metadata {
    name = "linkerd-cni"
    labels = {
      "config.linkerd.io/admission-webhooks" = "disabled"
      "linkerd.io/cni-resource"              = "true"
    }
  }
}

resource "helm_release" "linkerd_cni" {
  depends_on = [kubernetes_namespace.linkerd_namespace]
  name       = "linkerd-cni"
  repository = "https://helm.linkerd.io/edge"
  chart      = "linkerd2-cni"
  namespace  = "linkerd-cni"
  version    = "2024.3.1"
  wait       = true
}

resource "helm_release" "linkerd_crds" {
  depends_on = [kubernetes_namespace.linkerd_namespace, helm_release.linkerd_cni]
  name       = "linkerd-crds"
  repository = "https://helm.linkerd.io/edge"
  chart      = "linkerd-crds"
  namespace  = "linkerd"
  version    = "2024.3.1"
  wait       = true

  set {
    name  = "cniEnabled"
    value = true
  }
}

resource "helm_release" "linkerd" {
  depends_on = [helm_release.linkerd_crds]
  name       = "linkerd-control-plane"
  repository = "https://helm.linkerd.io/edge"
  chart      = "linkerd-control-plane"
  namespace  = "linkerd"
  version    = "2024.3.1"
  wait       = true

  values = [
    "${file("${path.module}/values-ha.yaml")}"
  ]

  set {
    name  = "controllerReplicas"
    value = var.replicas
  }

  set {
    name  = "cniEnabled"
    value = true
  }

  set {
    name  = "identity.externalCA"
    value = true
  }

  set {
    name  = "identity.issuer.scheme"
    value = "kubernetes.io/tls"
  }

  set {
    name  = "disableHeartBeat"
    value = true
  }

  set {
    name  = "webhookFailurePolicy"
    value = "Fail"
  }

  set {
    name  = "networkValidator.connectAddr"
    value = "0.0.0.0:4140"
  }
}

resource "helm_release" "linker-viz" {
  depends_on = [helm_release.linkerd_crds, module.linkerd_viz_policy]
  name       = "linkerd-viz"
  repository = "https://helm.linkerd.io/edge"
  chart      = "linkerd-viz"
  namespace  = "linkerd-viz"
  version    = "2024.6.1"
  wait       = true
}

values-ha.yaml

# This values.yaml file contains the values needed to enable HA mode.
# Usage:
#   helm install -f values-ha.yaml

# -- Create PodDisruptionBudget resources for each control plane workload
enablePodDisruptionBudget: true

controller:
  # -- sets pod disruption budget parameter for all deployments
  podDisruptionBudget:
    # -- Maximum number of pods that can be unavailable during disruption
    maxUnavailable: 1

# -- Specify a deployment strategy for each control plane workload
deploymentStrategy:
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 25%

# -- add PodAntiAffinity to each control plane workload
enablePodAntiAffinity: true

# nodeAffinity:

# proxy configuration
proxy:
  resources:
    cpu:
      request: 100m
    memory:
      limit: 250Mi
      request: 20Mi

# controller configuration
controllerReplicas: 3
controllerResources: &controller_resources
  cpu: &controller_resources_cpu
    limit: ""
    request: 100m
  memory:
    limit: 250Mi
    request: 50Mi
destinationResources: *controller_resources

# identity configuration
identityResources:
  cpu: *controller_resources_cpu
  memory:
    limit: 250Mi
    request: 10Mi

# heartbeat configuration
heartbeatResources: *controller_resources

# proxy injector configuration
proxyInjectorResources: *controller_resources
webhookFailurePolicy: Fail

# service profile validator configuration
spValidatorResources: *controller_resources

# flag for linkerd check
highAvailability: true

Logs, error output, etc

Every few minutes, the policy controller logs:

2024-06-19T22:29:04.143485Z  INFO networkauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-19T22:29:04.367419Z  INFO meshtlsauthentications: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-19T22:29:04.586154Z  INFO authorizationpolicies: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-19T22:29:05.138944Z  INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-19T22:29:05.401101Z  INFO httproutes.gateway.networking.k8s.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-19T22:29:06.185944Z  INFO httproutes.policy.linkerd.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-19T22:29:06.443435Z  INFO services: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)

Running it with a debug log level, I can see this:


2024-06-14T23:01:36.098797Z DEBUG Connection{peer=Server}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(37), flags: (0x4: END_HEADERS) }
2024-06-14T23:01:36.098872Z DEBUG Connection{peer=Server}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(37) }
2024-06-14T23:01:36.098892Z DEBUG Connection{peer=Server}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(37), flags: (0x1: END_STREAM) }
2024-06-14T23:01:36.099070Z DEBUG Connection{peer=Server}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(37), flags: (0x5: END_HEADERS | END_STREAM) }
2024-06-14T23:01:36.840186Z DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder reconciling cluster index.name=linkerd-destination-bcdf65946-6spv7
2024-06-14T23:01:39.458677Z DEBUG hyper::proto::h1::conn: incoming body decode error: Connection reset by peer (os error 104)
2024-06-14T23:01:39.458754Z DEBUG rustls::common_state: Sending warning alert CloseNotify    
2024-06-14T23:01:39.458781Z DEBUG hyper::proto::h1::conn: error shutting down IO: Broken pipe (os error 32)
2024-06-14T23:01:39.458903Z DEBUG hyper::client::client: client connection error: error shutting down connection: Broken pipe (os error 32)
2024-06-14T23:01:39.458914Z DEBUG servers: kube_runtime::watcher: watcher error: ReadEvents(Custom { kind: Other, error: hyper::Error(Body, hyper::Error(Body, Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })) })
2024-06-14T23:01:39.458950Z  INFO servers: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-14T23:01:39.459034Z DEBUG servers: tower::buffer::worker: service.ready=true processing request
2024-06-14T23:01:39.459097Z DEBUG servers:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta2/servers?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: kube_client::client::builder: requesting
2024-06-14T23:01:39.459151Z DEBUG servers:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta2/servers?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: hyper::client::pool: reuse idle connection for ("https", 10.0.0.1)
2024-06-14T23:01:39.459450Z DEBUG hyper::proto::h1::io: flushed 1953 bytes
2024-06-14T23:01:39.460952Z DEBUG hyper::proto::h1::io: parsed 7 headers
2024-06-14T23:01:39.460968Z DEBUG hyper::proto::h1::conn: incoming body is chunked encoding
2024-06-14T23:01:39.825269Z DEBUG hyper::proto::h1::conn: incoming body decode error: Connection reset by peer (os error 104)
2024-06-14T23:01:39.825333Z DEBUG rustls::common_state: Sending warning alert CloseNotify    
2024-06-14T23:01:39.825365Z DEBUG hyper::proto::h1::conn: error shutting down IO: Broken pipe (os error 32)
2024-06-14T23:01:39.825475Z DEBUG hyper::client::client: client connection error: error shutting down connection: Broken pipe (os error 32)
2024-06-14T23:01:39.827989Z DEBUG httproutes.policy.linkerd.io: kube_runtime::watcher: watcher error: ReadEvents(Custom { kind: Other, error: hyper::Error(Body, hyper::Error(Body, Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })) })
2024-06-14T23:01:39.828036Z  INFO httproutes.policy.linkerd.io: kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2024-06-14T23:01:39.828120Z DEBUG httproutes.policy.linkerd.io: tower::buffer::worker: service.ready=true processing request
2024-06-14T23:01:39.828184Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: kube_client::client::builder: requesting
2024-06-14T23:01:39.828270Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: hyper::client::connect::http: connecting to 10.0.0.1:443
2024-06-14T23:01:39.829996Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: hyper::client::connect::http: connected to 10.0.0.1:443
2024-06-14T23:01:39.830113Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::hs: Resuming session    
2024-06-14T23:01:39.845683Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::hs: Using ciphersuite TLS13_AES_128_GCM_SHA256    
2024-06-14T23:01:39.845723Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::tls13: Not resuming    
2024-06-14T23:01:39.845844Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::tls13: TLS1.3 encrypted extensions: []    
2024-06-14T23:01:39.845855Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::hs: ALPN protocol is None    
2024-06-14T23:01:39.845877Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::tls13: Got CertificateRequest CertificateRequestPayloadTLS13 { context: , extensions: [Unknown(UnknownExtension { typ: StatusRequest, payload:  }), Unknown(UnknownExtension { typ: SCT, payload:  }), SignatureAlgorithms([RSA_PSS_SHA256, ECDSA_NISTP256_SHA256, ED25519, RSA_PSS_SHA384, RSA_PSS_SHA512, RSA_PKCS1_SHA256, RSA_PKCS1_SHA384, RSA_PKCS1_SHA512, ECDSA_NISTP384_SHA384, ECDSA_NISTP521_SHA512, RSA_PKCS1_SHA1, ECDSA_SHA1_Legacy]), AuthorityNames([DistinguishedName(300d310b3009060355040313026361), DistinguishedName(3011310f300d060355040313066167672d6361)])] }    
2024-06-14T23:01:39.845932Z DEBUG httproutes.policy.linkerd.io:HTTP{http.method=GET http.url=https://10.0.0.1/apis/policy.linkerd.io/v1beta3/httproutes?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=228616531 otel.name="watch" otel.kind="client"}: rustls::client::common: Client auth requested but no cert/sigscheme available    
2024-06-14T23:01:39.846540Z DEBUG hyper::proto::h1::io: flushed 1956 bytes
2024-06-14T23:01:39.848708Z DEBUG hyper::proto::h1::io: parsed 7 headers
2024-06-14T23:01:39.848724Z DEBUG hyper::proto::h1::conn: incoming body is chunked encoding
2024-06-14T23:01:40.045298Z DEBUG hyper::proto::h1::conn: read eof
2024-06-14T23:01:41.039244Z DEBUG hyper::proto::h1::role: setting h1 header read timeout timer
2024-06-14T23:01:41.039253Z DEBUG hyper::proto::h1::role: setting h1 header read timeout timer
2024-06-14T23:01:41.039320Z DEBUG hyper::proto::h1::io: parsed 3 headers
2024-06-14T23:01:41.039329Z DEBUG hyper::proto::h1::conn: incoming body is empty
2024-06-14T23:01:41.039357Z DEBUG hyper::proto::h1::io: parsed 3 headers
2024-06-14T23:01:41.039372Z DEBUG hyper::proto::h1::conn: incoming body is empty
2024-06-14T23:01:41.039442Z DEBUG hyper::proto::h1::io: flushed 107 bytes
2024-06-14T23:01:41.040019Z DEBUG hyper::proto::h1::io: flushed 107 bytes
2024-06-14T23:01:41.752835Z DEBUG hyper::proto::h1::conn: incoming body decode error: Connection reset by peer (os error 104)
2024-06-14T23:01:41.752898Z DEBUG rustls::common_state: Sending warning alert CloseNotify    
2024-06-14T23:01:41.752927Z DEBUG hyper::proto::h1::conn: error shutting down IO: Broken pipe (os error 32)

output of linkerd check -o short

----------------
‼ trust anchors are valid for at least 60 days
    Anchors expiring soon:
    * 329674952538096284210311201227367316359 root.linkerd.cluster.local will expire on 2024-06-19T23:29:52Z
    see https://linkerd.io/2.13/checks/#l5d-identity-trustAnchors-not-expiring-soon for hints
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2024-06-19T23:16:07Z
    see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    unsupported version channel: stable-2.13.6
    see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.3.1 but the latest edge version is 24.6.2
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.3.1 but cli running stable-2.13.6
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
    * linkerd-destination-669ccf4f8-rx85b (edge-24.3.1)
    * linkerd-destination-669ccf4f8-ssqd2 (edge-24.3.1)
    * linkerd-identity-7f6f95b45c-sndb2 (edge-24.3.1)
    * linkerd-identity-7f6f95b45c-zhcck (edge-24.3.1)
    * linkerd-proxy-injector-6466cbdf79-9ftwx (edge-24.3.1)
    * linkerd-proxy-injector-6466cbdf79-n5gcs (edge-24.3.1)
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-669ccf4f8-rx85b running edge-24.3.1 but cli running stable-2.13.6
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-cli-version for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.13/checks/#l5d-injection-disabled for hints

linkerd-viz
-----------
‼ viz extension proxies are healthy
    Some pods do not have the current trust bundle and must be restarted:
    * metrics-api-5f994ffcd4-gt62w
    * prometheus-54f9848dd8-pllbj
    * tap-c745745f5-bptpg
    * tap-injector-5b5898c87-j5tl7
    * web-7c948c6b48-hzsz4
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-healthy for hints
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
    * metrics-api-5f994ffcd4-gt62w (edge-24.3.1)
    * prometheus-54f9848dd8-pllbj (edge-24.3.1)
    * tap-c745745f5-bptpg (edge-24.3.1)
    * tap-injector-5b5898c87-j5tl7 (edge-24.3.1)
    * web-7c948c6b48-hzsz4 (edge-24.3.1)
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-5f994ffcd4-gt62w running edge-24.3.1 but cli running stable-2.13.6
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cli-version for hints

Environment

Possible solution

No response

Additional context

I modified the policy controller container image by adding: ls, wget, sh and curl

Would you like to work on fixing this bug?

None

alpeb commented 1 week ago

It seems the stream is getting torn every few minutes. Can you tell how often that happens, and whether that interval of time is consistent? Besides the log entries, is this causing changes in policy resources (Server, HTTPRoute, AuthorizationPolicy, etc) to not being detected?

xolott commented 1 week ago

The errors are logged every 5 or 10 minutes (either of those, not a range) for about an hour. Sometimes, all the resource's watchers fail around the same time. Sometimes, they take turns: HTTP Routes fail for one hour, then MeshTlsAuthentications fail for another hour.

The last log entry was six hours ago, though.

I don't directly use those resources; we only have the default resources created during installation. Any way to test this out?