linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.47k stars 1.26k forks source link

Proxy trying to connect to no-longer available endpoints #12781

Open peterhuberit opened 5 days ago

peterhuberit commented 5 days ago

What is the issue?

Occasionally, some of our pod requests fail with a 504 error, indicating that the request is attempting to reach an unavailable (or no longer available) IP address. This issue does not occur when Linkerd mesh is not in use. Restarting the affected pods resolves the issue. The problem looks like similar then this, but since we are using a newer version of Linkerd it could be something else: https://github.com/linkerd/linkerd2/issues/6842

How can it be reproduced?

We can't reproduced it yet, it happens time-to-time, but we don't know what causes it.

Logs, error output, etc

The request failed with 504 error while trying to reach the pods of config-service service, because the requested IP is no longer available (in this case: 100.66.27.231). linkerd tap command logs on the source pod:

req id=103:3 proxy=in  src=100.66.26.82:50966 dst=100.66.29.176:8080 tls=true :method=GET :authority=172.20.183.211 :path=/v1/configurations
req id=103:4 proxy=out src=100.66.29.176:43828 dst=100.66.27.231:8080 tls=not_provided_by_service_discovery :method=GET :authority=100.66.27.231:8080 :path=/config
rsp id=103:4 proxy=out src=100.66.29.176:43828 dst=100.66.27.231:8080 tls=not_provided_by_service_discovery :status=504 latency=1001350µs
end id=103:4 proxy=out src=100.66.29.176:43828 dst=100.66.27.231:8080 tls=not_provided_by_service_discovery duration=15µs response-length=0B
rsp id=103:3 proxy=in  src=100.66.26.82:50966 dst=100.66.29.176:8080 tls=true :status=500 latency=1014335µs
end id=103:3 proxy=in  src=100.66.26.82:50966 dst=100.66.29.176:8080 tls=true duration=3105µs response-length=191B

This 100.66.27.231 IP doesnt exist in the whole cluster, not just in the config-service or namespace cluster. All the pods, service and node IPs checked, the IP is not available on the moment of the error.

k8s endpoints checked for config-service:

kubectl get endpoints config-service -o json | jq ".subsets[0].addresses[] | .ip"
"100.66.24.216"
"100.66.27.242"
"100.66.28.41"

config-service linkerd endpoints:

NAMESPACE   IP              PORT   POD                               SERVICE
uat01       100.66.27.242   8080   config-service-64f874ff57-wv9lb   config-service.uat01
uat01       100.66.28.41    8080   config-service-64f874ff57-gfbl2   config-service.uat01
uat01       100.66.24.216   8080   config-service-64f874ff57-p444r   config-service.uat01

output of linkerd check -o short

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2024-06-28T04:01:32Z
    see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2024-06-27T03:17:53Z
    see https://linkerd.io/2/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2024-06-27T03:17:10Z
    see https://linkerd.io/2/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
    certificate will expire on 2024-06-27T03:17:53Z
    see https://linkerd.io/2/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ can determine the latest version
    Get "https://versioncheck.linkerd.io/version.json?version=edge-24.3.4&uuid=0b1baa44-cadd-4e23-a446-35219f6b800c&source=cli": stream error: stream ID 1; NO_ERROR; received from peer
    see https://linkerd.io/2/checks/#l5d-version-latest for hints
‼ cli is up-to-date
    unable to determine version channel
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    unable to determine version channel
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.3.2 but cli running edge-24.3.4
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    unable to determine version channel
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-6ffdcb5dc7-xpsgj running edge-24.3.2 but cli running edge-24.3.4
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
    certificate will expire on 2024-06-27T03:28:02Z
    see https://linkerd.io/2/checks/#l5d-tap-cert-not-expiring-soon for hints
‼ viz extension proxies are up-to-date
    Get "https://versioncheck.linkerd.io/version.json?version=edge-24.3.4&uuid=unknown&source=cli": stream error: stream ID 1; NO_ERROR; received from peer
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-76499b55cc-5p47g running edge-24.3.2 but cli running edge-24.3.4
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

Kubernetes Version: v1.29.4-eks-036c24b Cluster Environment: AWS Linkerd version: edge-24.3.2

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

andrewdinunzio commented 3 days ago

I think we are seeing this issue as well in 2024.5.5.