linkerd-destination seemingly serving incorrect Pod IPs causing connection errors

linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.

https://linkerd.io

Apache License 2.0

10.62k stars 1.28k forks source link

linkerd-destination seemingly serving incorrect Pod IPs causing connection errors #8956

Closed dwilliams782 closed 2 years ago

dwilliams782 commented 2 years ago

What is the issue?

Slack thread here.

One of our application’s proxy’s is full of failed to connect errors. Examples:

[   584.018861s]  WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.17.10:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s

 [  2104.773832s]  WARN ThreadId(01) linkerd_reconnect: Failed to connect error=connect timed out after 1s

 [  1785.088002s]  WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.19.9:4317}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

Debug logs aren’t much use:

DEBUG ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.18:4317}: linkerd_reconnect: Recovering

offer-price-index-service-ie-66bbbfcff-f8kbg [  1851.149239s] DEBUG ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.18:4317}: linkerd_reconnect: Disconnected backoff=true

The service otel-agent.open-telemetry.svc.cluster.local:4317 points to a DaemonSet, so pods are constantly switching with our preemptible nodes.

I caught a failed request in Tap, where all other requests are ~1ms:

I then found this log entry:

2022-07-22 12:48:54
WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.4:4317}: linkerd_reconnect: Service failed error=channel closed

10.225.13.4 was for a pod, otel-agent-6kg56 until yesterday (21/07) at 13:52. That IP then got reassigned to pod tap-injector-7c4b6dc5f9-7wsxn. For some reason, the linkerd-proxy is sending requests to IPs that no longer belong to that Daemonset.

I run three replicas of linkerd-destination, which at the time of writing had:

linkerd-destination-8f9d689fc-lxl5f       4/4     Running   0          3d22h
linkerd-destination-8f9d689fc-mddv5       4/4     Running   0          21h
linkerd-destination-8f9d689fc-rn8dn       4/4     Running   0          21h

I searched the last two days of logs from linkerd-destination pods for any logs relating to the otel-agent pod that was no longer relevant and found two log entries:

2022-07-21 14:51:44 
 linkerd-destination-8f9d689fc-lxl5f time=“2022-07-21T13:51:44Z” level=error msg=“Unable to create new address:unable to fetch pod open-telemetry/otel-agent-6kg56: pod \“otel-agent-6kg56\” not found” addr=“:8086” component=service-publisher ns=open-telemetry port=4317 svc=otel-agent
2022-07-21 14:51:44 
 linkerd-destination-8f9d689fc-zqxhr time=“2022-07-21T13:51:44Z” level=error msg=“Unable to create new address:unable to fetch pod open-telemetry/otel-agent-6kg56: pod \“otel-agent-6kg56\” not found” addr=“:8086” component=service-publisher ns=open-telemetry port=4317 svc=otel-agent

zqxhr had since been deleted but lxl5f is still alive. My theory is that this linkerd-destination pod has entered into an unhealthy state, but is still ready, and sending "bad information" to the linkerd-proxy containers requesting destination information.

I then killed the linkerd-destination pod with these errors at 13:10 and have not seen this error come back again.

How can it be reproduced?

Not confirmed, but removing pods from a Daemonset and checking if the destination updates?

Logs, error output, etc

See section above.

output of `linkerd check -o short`

I have a mismatched CLI version so ignore the errors:

Linkerd core checks
===================

linkerd-version
---------------
‼ cli is up-to-date
    is running version 22.6.2 but the latest edge version is 22.7.2
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 22.7.1 but the latest edge version is 22.7.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-22.7.1 but cli running edge-22.6.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
    * alpine-curl-6cd54fb65d-j6z7p (edge-22.7.1)
    * linkerd-destination-8f9d689fc-jrkr2 (edge-22.7.1)
    * linkerd-destination-8f9d689fc-mddv5 (edge-22.7.1)
    * linkerd-destination-8f9d689fc-rn8dn (edge-22.7.1)
    * linkerd-identity-55c8d467bf-hr479 (edge-22.7.1)
    * linkerd-identity-55c8d467bf-r4bd4 (edge-22.7.1)
    * linkerd-identity-55c8d467bf-zvnvl (edge-22.7.1)
    * linkerd-proxy-injector-84c59f7bf9-5rnj4 (edge-22.7.1)
    * linkerd-proxy-injector-84c59f7bf9-r692g (edge-22.7.1)
    * linkerd-proxy-injector-84c59f7bf9-wjh44 (edge-22.7.1)
    * nginx-67888f898b-x9mz4 (edge-22.7.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    alpine-curl-6cd54fb65d-j6z7p running edge-22.7.1 but cli running edge-22.6.2
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

Linkerd extensions checks
=========================

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
    * metrics-api-794fbbf94b-z7b5r (edge-22.7.1)
    * prometheus-6cd858c97b-rh9xr (edge-22.7.1)
    * tap-68b97d8d8b-szrc9 (edge-22.7.1)
    * tap-68b97d8d8b-tvj55 (edge-22.7.1)
    * tap-68b97d8d8b-xhs69 (edge-22.7.1)
    * tap-injector-7c4b6dc5f9-7wsxn (edge-22.7.1)
    * web-6576dcfdb-g4mhd (edge-22.7.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-794fbbf94b-z7b5r running edge-22.7.1 but cli running edge-22.6.2
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

Kubernetes version: 1.22 Host: GKE Linkerd version: edge-22.7.1

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

olix0r commented 2 years ago

You could try running linkerd diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 to query the destination controller directly--this would help us know whether the destination controller hasn't seen those changes or if the proxy has stale service discovery. In the latter case, full proxy debug logs from the client would probably be a helpful starting point.

olix0r commented 2 years ago

I tried this locally with a trivial setup and wasn't able to encounter this error easily:

Setup a local cluster with multiple nodes:

:; k3d cluster create l5d-test --kubeconfig-update-default --kubeconfig-switch-context=false --image=+latest --no-lb --k3s-arg "--no-deploy=local-storage,traefik,servicelb,metrics-server@server:*" --agents=2

After installing linkerd, installing a daemonset application:

:; linkerd inject https://run.linkerd.io/emojivoto-daemonset.yml | k apply -f -

Delete once instance of one of the servers:

:; k delete po -n emojivoto emoji-vzlnd

Check each client pod for errors:

:; k logs po -n emojivoto web-... linkerd-proxy

Unfortunately, there were no errors.

dwilliams782 commented 2 years ago

You could try running linkerd diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 to query the destination controller directly--this would help us know whether the destination controller hasn't seen those changes or if the proxy has stale service discovery. In the latter case, full proxy debug logs from the client would probably be a helpful starting point.

I wasn't aware of diagnostics endpoints. This returns the correct list of IPs for the daemonset compared to the pods, so it doesn't seem like the problem is here. Do you know if this command picks one destination pod at random or queries all pods?

I can send full client debug logs but this will be a sizeable file, how would you like me to share them?

olix0r commented 2 years ago

Do you know if this command picks one destination pod at random or queries all pods?

Yeah, it's going to port forward to a random destination controller

I can send full client debug logs but this will be a sizeable file, how would you like me to share them?

Easiest would probably be to try to attach a gzip to the issue? Or you could try to isolate the ~minute around the target pod being deleted

dwilliams782 commented 2 years ago

Ok, so the most recent example of this is the pod otel-agent-rlr8q which according to the metrics, died at approx 14:56:

The IP for this pod is 10.225.4.3. The client proxy errors at 14:59 (3 minutes after!) but continues a connect / disconnect / recovering loop:

I've attached the logs from the client proxy showing connections to this IP - offer-price-index-service-ie-66bbbfcff-f8kbg-debug-logs.txt

Around the time the pod dies, at 14:56, I get errors from two out of three destination pods:


2022-07-22 14:56:18 | **linkerd-destination-8f9d689fc-s88zj** time="2022-07-22T13:56:18Z" level=error msg="Unable to create new address:unable to fetch pod open-telemetry/otel-agent-rlr8q: pod \"otel-agent-rlr8q\" not found" addr=":8086" component=service-publisher ns=open-telemetry port=4317 svc=otel-agent
-- | --
  |   | 2022-07-22 14:56:18 | **linkerd-destination-8f9d689fc-sw9j9** time="2022-07-22T13:56:18Z" level=error msg="Unable to create new address:unable to fetch pod open-telemetry/otel-agent-rlr8q: pod \"otel-agent-rlr8q\" not found" addr=":8086" component=service-publisher ns=open-telemetry port=4317 svc=otel-agent

Around this time, it also seems we have some shuffling around of linkerd-destination pods:

The two destination pods that restart here were running on preemptible nodes, I assume the node got killed at this time so two new replicas come up. The otel-agent pod that died at 14:56 would have also been because of the preemptible nodes disappearing.

I'm not sure if this helps or makes things more confusing 🤯 One of the destination pods has outlived the other two during this period due to not being on a preemptible node. I don't understand why the client is still now, hours later, trying to connect to this dead IP.

dwilliams782 commented 2 years ago

I'm going to move the destination pods to non preemptible nodes to hopefully get more stability, but I wouldn't expect this to be causing issues (unless the client proxy is caching this address for whatever reason, given the diagnostics gives the correct output).

dwilliams782 commented 2 years ago

Hi,

I wanted to follow up on this as we're still seeing this behaviour, only I've just caught it via diagnostics for the first time.

Log entry from our workload:

WARN ThreadId(01) outbound:accept{client.addr=10.225.0.10:39606}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.25.2:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s

Diagnostics thinks this pod IP is still valid:

>linkerd-edge diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | grep 10.225.25.2
open-telemetry   10.225.25.2    4317   otel-agent-6ccl5   otel-agent.open-telemetry

However that pod doesn't exist:

>kubectl get pod otel-agent-6ccl5 -n open-telemetry -o wide
Error from server (NotFound): pods "otel-agent-6ccl5" not found

Looking at metrics for this, I can see the pod above stopped around 7am, but Linkerd is still sending traffic to it:

dwilliams782 commented 2 years ago

Just digging into destination logs, and I have a number of the following:

level=error msg="Unable to create new address:unable to fetch pod open-telemetry/otel-agent-v5md6: pod \"otel-agent-v5md6\" not found" addr=":8086" component=service-publisher ns=open-telemetry port=4317 svc=otel-agent

In the last 12 hours, this log entry has appeared for the following pods:

otel-agent-bzs9t
otel-agent-6qlt1 
otel-agent-6ccl5
otel-agent-ctrqf
otel-agent-sjw8b
otel-agent-v5md6

otel-agent is a daemonset, and the above pods all have one thing in common: they all run on preemptible GKE nodes.

I wonder if there is a signal not being received when these nodes are preempted and removed?

dwilliams782 commented 2 years ago

I'm not sure if it's of any help but Google lists the preemption termination process here.

I've been digging further into this with the help of @mateiidavid this morning via a Slack thread here.

To start with, I'm just comparing the output of diagnostics to kubectl to prove there is a problem:

> k get pods -n open-telemetry --selector=component=otel-agent -o json | jq -r '.items[].metadata.name' | wc -l
28
> linkerd diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | grep otel-agent | wc -l
29

Matei advised I can use the diagnostics script to retrieve the endpoints for otel-agent from each destination pod individually. The diagnostics command above simply port-forwards to the deployment, thus you get a "random" response from one of your replicas. I wanted to confirm the view of the world from each destination pod, so I ran:

k port-forward pod/<linkerd-destination-pod> 8086
go run controller/script/destination-client/main.go -method get -path otel-agent.open-telemetry.svc.cluster.local:4317

I ran this for each destination pod (I have three replicas in HA mode) and got wildly different results. I took the get stream, exported to a sheet and regex'd around to get counts:

Note: I repeated this 3/4 times for each pod to make sure it wasn't a one-off.

In a nutshell, all current otel-agent pods exist in each destination pod, but so do a large number of pods that have been deleted many hours ago. Each red highlighted pod no longer exists but is still registered as a target.

The pods I've checked so far have had a corresponding log entry, which may or may not be relevant:

level=error msg="Unable to create new address:unable to fetch pod open-telemetry/otel-agent-2f6vm: pod \"otel-agent-2f6vm\" not found" addr=":8086" component=service-publisher ns=open-telemetry port=4317 svc=otel-agent

This pod, otel-agent-2f6vm appears in the columns for two out of three pods, interestingly.

I have no idea how to proceed with this other than disabling HA mode and seeing if it improves. This is our major blocker to rolling out Linkerd to our production clusters now, and I'm reluctant to go to production without HA mode.

dwilliams782 commented 2 years ago

This seems to be quite easily reproducible by deleting a GKE node:

>linkerd-edge diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | grep otel-agent-j79jl
open-telemetry   10.225.1.3      4317   otel-agent-j79jl   otel-agent.open-telemetry

>k get pod -n open-telemetry otel-agent-j79jl -o yaml | grep nodeName
nodeName: gke-gke-web-gb-dev-system-pool-01-pre-a1b91517-hjf7

>k delete node gke-gke-web-gb-dev-system-pool-01-pre-a1b91517-hjf7
node "gke-gke-web-gb-dev-system-pool-01-pre-a1b91517-hjf7" deleted

> k get pod -n open-telemetry otel-agent-j79jl -w         
NAME               READY   STATUS    RESTARTS   AGE
otel-agent-j79jl   1/1     Running   0          6h50m
otel-agent-j79jl   1/1     Terminating   0          6h50m
otel-agent-j79jl   1/1     Terminating   0          6h50m

>k get pod -n open-telemetry otel-agent-j79jl   
Error from server (NotFound): pods "otel-agent-j79jl" not found

>k get node gke-gke-web-gb-dev-system-pool-01-pre-a1b91517-hjf7
Error from server (NotFound): nodes "gke-gke-web-gb-dev-system-pool-01-pre-a1b91517-hjf7" not found

>linkerd-edge diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | grep otel-agent-j79jl
open-telemetry   10.225.1.3      4317   otel-agent-j79jl   otel-agent.open-telemetry

I have tried manually deleting two nodes with the above commands (one preemptible and one standard provisioning) and both pods still appear in the diagnostics endpoint. I've left it a good 30+ minutes and that pod is still appearing in the diag endpoints. Deleting otel-agent pods manually works and they disappear from the list immediately as expected.

Now, interestingly: Deleting a node and monitoring for a statefulset or deployment pod does not cause this issue. The statefulset/deployment pods disappear from the diagnostics immediately and reappeared when the new pod reaches Running status.

I then tried repeating this exercise for a different daemonset in the same cluster, deleting a node and watching diagnostics. The pod disappeared immediately.

The difference between the two daemonsets is one is using a clusterIP service and one is using a headless clusterIP service. The service for otel-agent:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: opentelemetry
    component: otel-agent
  name: otel-agent
  namespace: open-telemetry
spec:
  clusterIP: 10.224.28.33
  clusterIPs:
  - 10.224.28.33
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: otlp
    port: 4317
    protocol: TCP
    targetPort: 4317
  - name: zipkin
    port: 9411
    protocol: TCP
    targetPort: 9411
  - name: metrics
    port: 8888
    protocol: TCP
    targetPort: 8888
  selector:
    component: otel-agent
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

So, I created a headless service to sit alongside this:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2022-08-10T12:06:46Z"
  labels:
    app: opentelemetry
    component: otel-agent
  name: otel-agent-headless
  namespace: open-telemetry
  resourceVersion: "1216759460"
  uid: 9da722c7-4c58-4080-ad48-1c4603808384
spec:
  clusterIP: None
  clusterIPs:
  - None
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: otlp
    port: 4317
    protocol: TCP
    targetPort: 4317
  - name: zipkin
    port: 9411
    protocol: TCP
    targetPort: 9411
  - name: metrics
    port: 8888
    protocol: TCP
    targetPort: 8888
  selector:
    component: otel-agent
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

and then repeated the same test:

>linkerd-edge diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | grep otel-agent-pv58f
open-telemetry   10.225.3.2      4317   otel-agent-pv58f   otel-agent.open-telemetry

>linkerd-edge diagnostics endpoints otel-agent-headless.open-telemetry.svc.cluster.local:4317 | grep otel-agent-pv58f
open-telemetry   10.225.3.2      4317   otel-agent-pv58f   otel-agent-headless.open-telemetry

>k get pod otel-agent-pv58f -n open-telemetry -o yaml | grep nodeName
gke-gke-web-gb-dev-web-wi-n2-01-4b906099-pqg8

>k delete node gke-gke-web-gb-dev-web-wi-n2-01-4b906099-pqg8
node "gke-gke-web-gb-dev-web-wi-n2-01-4b906099-pqg8" deleted

>k get pod otel-agent-pv58f -n open-telemetry -w                     
NAME               READY   STATUS    RESTARTS   AGE
otel-agent-pv58f   1/1     Running   0          7m48s
otel-agent-pv58f   1/1     Terminating   0          8m34s
otel-agent-pv58f   1/1     Terminating   0          8m34s

>k get pod otel-agent-pv58f -n open-telemetry -w
Error from server (NotFound): pods "otel-agent-pv58f" not found

>linkerd-edge diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | grep otel-agent-pv58f
open-telemetry   10.225.3.2      4317   otel-agent-pv58f   otel-agent.open-telemetry

>linkerd-edge diagnostics endpoints otel-agent-headless.open-telemetry.svc.cluster.local:4317 | grep otel-agent-pv58f

>linkerd-edge diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | wc -l
34

>linkerd-edge diagnostics endpoints otel-agent-headless.open-telemetry.svc.cluster.local:4317 | wc -l
24

As you can see, the pod disappeared immediately from the diagnostics from the headless service, but remains in the clusterIP service.

I'm going to keep testing this headless vs non-headless theory with deployments / statefulsets as I can't remember whether the deployment and STS I tried had headless / non-headless, but this is promising.

dwilliams782 commented 2 years ago

I checked our development cluster this morning which runs with only one replica of destination and can confirm the same issue is present, so this is not HA related.

kleimkuhler commented 2 years ago

Hey @dwilliams782 thanks for the investigation and updates on this. I attempted to reproduce this today without luck. I'm going to read through things again to make sure I didn't miss anything, but I'll leave my reproduction steps below.

It'd be helpful if you can test this on GKE, as I'm using k3d and not seeing any issues.

Setup

I'm using the following nginx DaemonSet with a ClusterIP service

apiVersion: v1
kind: Service
metadata:
  name: nginx-svc
spec:
  selector:
    app: nginx-ds
  ports:
    - port: 80
      targetPort: 80
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nginx-ds
spec:
  selector:
    matchLabels:
      app: nginx-ds
  template:
    metadata:
      labels:
        app: nginx-ds
      annotations:
        linkerd.io/inject: enabled
    spec:
      containers:
        - name: nginx
          image: nginx

I have Linkerd installed all on the same node

$ kubectl get -n linkerd pods -o wide
NAME                                      READY   STATUS    RESTARTS   AGE   IP           NODE                       NOMINATED NODE   READINESS GATES
linkerd-proxy-injector-77f864cf57-8nfd9   2/2     Running   0          56s   10.42.1.16   k3d-k3s-default-server-0   <none>           <none>
linkerd-identity-856d7bd7c-fj52r          2/2     Running   0          56s   10.42.1.15   k3d-k3s-default-server-0   <none>           <none>
linkerd-destination-8cbf85b67-2jgf6       4/4     Running   0          56s   10.42.1.14   k3d-k3s-default-server-0   <none>           <none>

I create a second node and then begin the test

$ k3d node create test
...

$ kubectl get nodes
NAME                       STATUS     ROLES                  AGE   VERSION
k3d-k3s-default-server-0   Ready      control-plane,master   15m   v1.22.6+k3s1
k3d-test-0                 NotReady   <none>                 3s    v1.22.6+k3s1

Attempted Reproduction

Create the nginx Daemonset and Service and check Linkerd's view

$ kubectl apply -f ~/Projects/tmp/nginx-daemonset.yaml
service/nginx-svc created
daemonset.apps/nginx-ds created

$ kubectl get pods -o wide
NAME             READY   STATUS    RESTARTS   AGE   IP           NODE                       NOMINATED NODE   READINESS GATES
nginx-ds-x5w55   2/2     Running   0          99s   10.42.1.17   k3d-k3s-default-server-0   <none>           <none>
nginx-ds-jfh9b   2/2     Running   0          97s   10.42.2.2    k3d-test-0                 <none>           <none>

$ linkerd diagnostics endpoints nginx-svc.default.svc.cluster.local:80
NAMESPACE   IP           PORT   POD              SERVICE
default     10.42.1.17   80     nginx-ds-x5w55   nginx-svc.default
default     10.42.2.2    80     nginx-ds-jfh9b   nginx-svc.default

Delete k3d-test-0 and wait for it to be reflected with kubectl

$ kubectl delete node k3d-test-0 
node "k3d-test-0" deleted

$ kubectl get pods
NAME             READY   STATUS    RESTARTS   AGE
nginx-ds-x5w55   2/2     Running   0          111s

Instantly reflected through Linkerd

$ linkerd diagnostics endpoints nginx-svc.default.svc.cluster.local:80
NAMESPACE   IP           PORT   POD              SERVICE
default     10.42.1.17   80     nginx-ds-x5w55   nginx-svc.default

Let me know if you think I'm missing anything obvious, but from your more recent comments I think everything is there and this is not running Linkerd in HA mode. nginx-svc is a ClusterIP Service and I'm deleting a Node that is hosting a DaemonSet replica.

kleimkuhler commented 2 years ago

To be clear, I do think there is a bug here in the destination service given that you can create two Services that target the same DaemonSet, and one stays up to date while the other returns stale endpoints; the issue now is reducing this down to a more minimal example so that the bug is easier to track down.

As I stated above, it'd be great if you can try the same reproduction as more on GKE and let me know if you see anything different. I'd be surprised if GKE was the culprit here, but it could be something different about how Pods are de-scheduled on Nodes that are shutting down compared to k3d (or something else).

The log line that you are seeing `unable to fetch pod ...: pod ... not found" is definitely relevant. This is occurring in the destination controller when trying to update its view of endpoints. It tries to lookup a Pod with the k8s API and — expecting to find it — it fails and returns this error. It seems like this should lead to the address being cleared, but that doesn't appear to be happening. If you're unable to see any more logs than that, it may be worth adding some additional log lines in and testing with that.

dwilliams782 commented 2 years ago

Hi Kevin,

Thanks for attempting to recreate. Like you, I have been unable to recreate this with any other resource. I've tried deployments, statefulsets, even additional daemonsets with both headless and clusterIP services, and cannot reproduce this outside of this affected otel-agent daemonset.

A colleague this morning pointed out that there are multiple ports exposed on the service and daemonset, so I tried diagnostics for all ports:

>k get pods -n open-telemetry --selector=component=otel-agent -o json | jq -r '.items[].metadata.name' | wc -l
24

>linkerd diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | grep otel-agent | wc -l
31

>linkerd diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:9411 | grep otel-agent | wc -l
24

>linkerd diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:8888 | grep otel-agent | wc -l
24

>linkerd diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local | grep otel-agent | wc -l
24

You can see from this every other service:port combination is correct, even not specifying a port is correct, however the profile for 4317 is incorrect?

eugenepaniot commented 2 years ago

Well, my quick updates. I have tried to fix it, but my knowledge on linkerd internals is not enough, as it has quite complex and threaded, cached internals, etc..

So, linkerd syncs addresses from endpointslices.discovery.k8s.io (-enable-endpoint-slices=true) that has record, ex: (let's say it is our initial state)

...
  targetRef:
    kind: Pod
    name: otel-agent-9j8n5
    namespace: open-telemetry
...

Once Deployment restarted kubectl -n open-telemetry rollout restart daemonset.apps/otel-agent (endpointslices changed), linkerd reconciles it, and (seems) doesn't cleanup previous record otel-agent-9j8n5 (may be related to diffAddresses function or somewhere skipping continue code or conditioned err, and it seems it simply doesn't reach Delete function, or so on):

...
linkerd-destination-5687bc5cdb-bv9l4 destination time="2022-08-15T14:15:34Z" level=debug msg="Updating EndpointSlice for open-telemetry/otel-agent" addr=":8086" component=service-publisher ns=open-telemetry svc=otel-agent
...
linkerd-destination-5687bc5cdb-bv9l4 destination EP: updateEndpointSlice port.id: /
linkerd-destination-5687bc5cdb-bv9l4 destination EP: updateEndpointSlice port.targetPort: otlp
linkerd-destination-5687bc5cdb-bv9l4 destination EP: updateEndpointSlice port.srcPort: 4317
linkerd-destination-5687bc5cdb-bv9l4 destination EP: updateEndpointSlice oldSlice otel-agent-w4mq9
linkerd-destination-5687bc5cdb-bv9l4 destination EP: updateEndpointSlice newSlice otel-agent-w4mq9

...

linkerd-destination-5687bc5cdb-bv9l4 destination E0815 14:15:37.068327       1 runtime.go:77] Observed a panic: pod "otel-agent-9j8n5" not found
linkerd-destination-5687bc5cdb-bv9l4 destination goroutine 142 [running]:
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1a11c20?, 0xc00443ec80})
linkerd-destination-5687bc5cdb-bv9l4 destination    /go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/runtime/runtime.go:75 +0x99
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x0?})
linkerd-destination-5687bc5cdb-bv9l4 destination    /go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/runtime/runtime.go:49 +0x75
linkerd-destination-5687bc5cdb-bv9l4 destination panic({0x1a11c20, 0xc00443ec80})
linkerd-destination-5687bc5cdb-bv9l4 destination    /usr/local/go/src/runtime/panic.go:838 +0x207
linkerd-destination-5687bc5cdb-bv9l4 destination github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).newPodRefAddress(0xc001c555e0, 0x10dd, {0xc001e888a0, 0x0?}, {0xc001e888d0, 0x10}, {0xc001e888c0, 0xe})
linkerd-destination-5687bc5cdb-bv9l4 destination    /linkerd-build/controller/api/destination/watcher/endpoints_watcher.go:858 +0x2bc
linkerd-destination-5687bc5cdb-bv9l4 destination github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).endpointSliceToAddresses(0xc001c555e0, 0xc0033abe40)
linkerd-destination-5687bc5cdb-bv9l4 destination    /linkerd-build/controller/api/destination/watcher/endpoints_watcher.go:756 +0x89c
linkerd-destination-5687bc5cdb-bv9l4 destination github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).updateEndpointSlice(0xc001c555e0, 0xc0033abe40, 0xc001de66e0)
linkerd-destination-5687bc5cdb-bv9l4 destination    /linkerd-build/controller/api/destination/watcher/endpoints_watcher.go:645 +0x269
linkerd-destination-5687bc5cdb-bv9l4 destination github.com/linkerd/linkerd2/controller/api/destination/watcher.(*servicePublisher).updateEndpointSlice(0xc00095df40, 0xc002c2da90?, 0x43ae16?)
linkerd-destination-5687bc5cdb-bv9l4 destination    /linkerd-build/controller/api/destination/watcher/endpoints_watcher.go:452 +0x2cb
linkerd-destination-5687bc5cdb-bv9l4 destination github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateEndpointSlice(0xc0006b5800, {0x1d83c60?, 0xc0033abe40?}, {0x1d83c60?, 0xc001de66e0?})
linkerd-destination-5687bc5cdb-bv9l4 destination    /linkerd-build/controller/api/destination/watcher/endpoints_watcher.go:332 +0x10b
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
linkerd-destination-5687bc5cdb-bv9l4 destination    /go/pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/controller.go:239
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/client-go/tools/cache.(*processorListener).run.func1()
linkerd-destination-5687bc5cdb-bv9l4 destination    /go/pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/shared_informer.go:816 +0x137
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
linkerd-destination-5687bc5cdb-bv9l4 destination    /go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:155 +0x3e
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00006df38?, {0x207ba20, 0xc0006ecf90}, 0x1, 0xc0007885a0)
linkerd-destination-5687bc5cdb-bv9l4 destination    /go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:156 +0xb6
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0xc00006df88?)
linkerd-destination-5687bc5cdb-bv9l4 destination    /go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:133 +0x89
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/wait.Until(...)
linkerd-destination-5687bc5cdb-bv9l4 destination    /go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:90
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/client-go/tools/cache.(*processorListener).run(0xc0001dbb80?)
linkerd-destination-5687bc5cdb-bv9l4 destination    /go/pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/shared_informer.go:812 +0x6b
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
linkerd-destination-5687bc5cdb-bv9l4 destination    /go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:73 +0x5a
linkerd-destination-5687bc5cdb-bv9l4 destination created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
linkerd-destination-5687bc5cdb-bv9l4 destination    /go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:71 +0x85
...

How to reproduce:

Make sure enable-endpoint-slices=true;
Continuously restart existing daemonset/deployment/sts until error unable to fetch pod thrown 2.1. Wait until restart completed
See differences linkerd diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317

Open questions:

When, and Why does pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: forcing resync execute? And why it has never been executed?

dwilliams782 commented 2 years ago

For context, Eugene is my (far more experienced!) colleague and has been looking into this today. We have disabled endpointSlices via enableEndpointSlices: false and will continue to monitor.

Edit: thought I'd add, I just noticed that the stable version 2.11 sets endpointSlices to false, and the edge release sets it to true by default.

kleimkuhler commented 2 years ago

@eugenepaniot In the destination logs that you posted above, had you made any code changes or is it from an edge release version? I see the original issue was using edge-22.7.1... is that still the case? I'm trying to make sense of this panic you posted above and the line numbers don't seem to match with what I'd expect.

In regards to the reproduction, I attempted this again with the nginx-ds DaemonSet that I posted above here and haven't managed to get the destination component into a bad state. I've verified -enable-endpoint-slices=true and I've restarted the DaemonSet multiple times without issue.

Have you still only see this occur with the otel-agen Pods? It'd be great if you can at least reproduce this with the DaemonSet I posted above, otherwise can you provide a little more information about what would make these Pods unique from others? If you can provide manifests for a reproduction that would be greatly helpful.

eugenepaniot commented 2 years ago

@kleimkuhler yes, I added some debug things (panic for example) to check args and catch the issue, and so on. I used the main branch for it. There is nothing uniq with otel-agen application, just except of the quantity (~25). I will try to reproduce it with nginx-ds thought..

dwilliams782 commented 2 years ago

@kleimkuhler Thanks for the new --destination-pod flag feature in #9200. This was something I suggested in Slack and Matei raised for me. Awesome.

We've been unable to reproduce this with anything other than this one particular daemonset, service and port combination. Every other port for that service reported correctly.

Since we last posted, this daemonset has undergone a big version bump, and we also removed another deployment/service in the same namespace which also used the same port (4317) but had different selectors. This should make zero difference, but since these changes, I've been unable to reproduce.

We've now replaced this daemonset with a deployment (massively reducing replica count), re-enabled endpointSlices and it looks to be stable so far.

There definitely is a bug here, somewhere, but I have no idea what more I can do to try and reproduce this. I may well spin up a new GKE cluster with only Linkerd and the same set up of otel we had when we encountered this bug, but that will need to wait until I have some time.

kleimkuhler commented 2 years ago

Ok thanks for the update on this. Sounds like a lot has a changed on your end; glad to hear it sounds like things are stable for you now but also very interested to hear if you end up being able to reproduce this! I'll keep this open for now and let me know if anything changes on your end.

linkerd / linkerd2

linkerd-destination seemingly serving incorrect Pod IPs causing connection errors #8956

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

Setup

Attempted Reproduction

output of `linkerd check -o short`