linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.59k stars 1.27k forks source link

SR inconsistency in destination metrics #11066

Closed someone-stole-my-name closed 10 months ago

someone-stole-my-name commented 1 year ago

What is the issue?

While debugging https://github.com/linkerd/linkerd2/issues/11065 I found something weird. When using v2.13 these failed requests do not count towards the success rate of destination, while in previous versions e.g. v2.11 they did.

How can it be reproduced?

Same as https://github.com/linkerd/linkerd2/issues/11065, add the manifests from the gist and see the difference when using 2.13 vs 2.11

Logs, error output, etc

With 2.11:

2 11-prom 2 11-viz

With 2.13:

2 13-prom 2 13-viz

output of linkerd check -o short

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.13.4 but the latest stable version is 2.13.5
    see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.13.4 but the latest stable version is 2.13.5
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-6b566cf687-mzs2w (stable-2.13.4)
        * linkerd-identity-77bbfc58bb-mgrwh (stable-2.13.4)
        * linkerd-proxy-injector-6f5b6c8798-nw9f5 (stable-2.13.4)
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
        * metrics-api-59c76c4d75-vqkw5 (stable-2.13.4)
        * prometheus-b7b44d965-dpflx (stable-2.13.4)
        * tap-7c8fb95758-tn5zb (stable-2.13.4)
        * tap-injector-586d58cf8f-x8t9r (stable-2.13.4)
        * web-7cf5484879-9g888 (stable-2.13.4)
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cp-version for hints

Status check results are √

Environment

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

stale[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

adleong commented 11 months ago

Thanks for the detailed info @someone-stole-my-name. I've confirmed this issue with the repro steps you provided. Digging a little deeper it seems like there's an inconsistency between the proxy metrics (which are used as the data source for the success rate in the octopus graph) and the tap events (which are used as the data source for the live calls table).

Looking at an individual tap from the destination controller's proxy we have:

req id=4:1 proxy=in  src=10.244.0.94:58162 dst=10.244.0.96:8086 tls=true :method=POST :authority=linkerd-dst-headless.linkerd.svc.cluster.local:8086 :path=/io.linkerd.proxy.destination.Destination/GetProfile
rsp id=4:1 proxy=in  src=10.244.0.94:58162 dst=10.244.0.96:8086 tls=true :status=200 latency=589µs
end id=4:1 proxy=in  src=10.244.0.94:58162 dst=10.244.0.96:8086 tls=true grpc-status=Unknown duration=5µs response-length=0B

we can see that the http status is 200 and the grpc-status is Unknown (grpc status code 2). The live requests table interprets this as a failure because of the grpc-status.

However, if we look at the proxy metrics from the destination controller's proxy we see:

response_total{direction="inbound",authority="linkerd-dst-headless.linkerd.svc.cluster.local:8086",target_addr="10.244.0.96:8086",target_ip="10.244.0.96",target_port="8086",tls="true",client_id="default.default.serviceaccount.identity.linkerd.cluster.local",srv_group="",srv_kind="default",srv_name="all-unauthenticated",route_group="",route_kind="default",route_name="default",authz_group="",authz_kind="default",authz_name="all-unauthenticated",status_code="200",classification="success",grpc_status="",error=""} 4133

in the response_total metric, we see that the classification is success and no grpc_status is recorded.

I plan to investigate further to see if I can figure out why the classification and grpc_status on this metric isn't reflecting the same thing as the tap output.

Full tap in json format with headers ``` { "source": { "ip": "10.244.0.94", "port": 58162, "metadata": { "client_id": "default.default.serviceaccount.identity.linkerd.cluster.local", "control_plane_ns": "linkerd", "deployment": "server", "namespace": "default", "pod": "server-77975c9645-zt6sr", "pod_template_hash": "77975c9645", "serviceaccount": "default", "tls": "true" } }, "destination": { "ip": "10.244.0.96", "port": 8086, "metadata": { "authz_group": "", "authz_kind": "default", "authz_name": "all-unauthenticated", "control_plane_ns": "linkerd", "deployment": "linkerd-destination", "namespace": "linkerd", "pod": "linkerd-destination-5fd9dc8fd8-8d695", "pod_template_hash": "5fd9dc8fd8", "route_group": "", "route_kind": "default", "route_name": "default", "serviceaccount": "linkerd-destination", "srv_group": "", "srv_kind": "default", "srv_name": "all-unauthenticated", "tls": "loopback" } }, "routeMeta": null, "proxyDirection": "INBOUND", "requestInitEvent": { "id": { "base": 5, "stream": 0 }, "method": "POST", "scheme": "HTTP", "authority": "linkerd-dst-headless.linkerd.svc.cluster.local:8086", "path": "/io.linkerd.proxy.destination.Destination/GetProfile", "headers": [ { "name": ":method", "valueStr": "POST" }, { "name": ":scheme", "valueStr": "http" }, { "name": ":authority", "valueStr": "linkerd-dst-headless.linkerd.svc.cluster.local:8086" }, { "name": ":path", "valueStr": "/io.linkerd.proxy.destination.Destination/GetProfile" }, { "name": "te", "valueStr": "trailers" }, { "name": "content-type", "valueStr": "application/grpc" }, { "name": "l5d-client-id", "valueStr": "default.default.serviceaccount.identity.linkerd.cluster.local" } ] } } { "source": { "ip": "10.244.0.94", "port": 58162, "metadata": { "client_id": "default.default.serviceaccount.identity.linkerd.cluster.local", "control_plane_ns": "linkerd", "deployment": "server", "namespace": "default", "pod": "server-77975c9645-zt6sr", "pod_template_hash": "77975c9645", "serviceaccount": "default", "tls": "true" } }, "destination": { "ip": "10.244.0.96", "port": 8086, "metadata": { "authz_group": "", "authz_kind": "default", "authz_name": "all-unauthenticated", "control_plane_ns": "linkerd", "deployment": "linkerd-destination", "namespace": "linkerd", "pod": "linkerd-destination-5fd9dc8fd8-8d695", "pod_template_hash": "5fd9dc8fd8", "route_group": "", "route_kind": "default", "route_name": "default", "serviceaccount": "linkerd-destination", "srv_group": "", "srv_kind": "default", "srv_name": "all-unauthenticated", "tls": "loopback" } }, "routeMeta": null, "proxyDirection": "INBOUND", "responseInitEvent": { "id": { "base": 5, "stream": 0 }, "sinceRequestInit": { "nanos": 587923 }, "httpStatus": 200, "headers": [ { "name": ":status", "valueStr": "200" }, { "name": "content-type", "valueStr": "application/grpc" }, { "name": "grpc-status", "valueStr": "2" }, { "name": "grpc-message", "valueStr": "failed to get pod for hostname 10-244-0-94: no pod found in Endpoints default/server for hostname 10-244-0-94" } ] } } { "source": { "ip": "10.244.0.94", "port": 58162, "metadata": { "client_id": "default.default.serviceaccount.identity.linkerd.cluster.local", "control_plane_ns": "linkerd", "deployment": "server", "namespace": "default", "pod": "server-77975c9645-zt6sr", "pod_template_hash": "77975c9645", "serviceaccount": "default", "tls": "true" } }, "destination": { "ip": "10.244.0.96", "port": 8086, "metadata": { "authz_group": "", "authz_kind": "default", "authz_name": "all-unauthenticated", "control_plane_ns": "linkerd", "deployment": "linkerd-destination", "namespace": "linkerd", "pod": "linkerd-destination-5fd9dc8fd8-8d695", "pod_template_hash": "5fd9dc8fd8", "route_group": "", "route_kind": "default", "route_name": "default", "serviceaccount": "linkerd-destination", "srv_group": "", "srv_kind": "default", "srv_name": "all-unauthenticated", "tls": "loopback" } }, "routeMeta": null, "proxyDirection": "INBOUND", "responseEndEvent": { "id": { "base": 5, "stream": 0 }, "sinceRequestInit": { "nanos": 598724 }, "sinceResponseInit": { "nanos": 10801 }, "responseBytes": 0, "trailers": null, "grpcStatusCode": 2 } } ```

olix0r commented 10 months ago

This was fixed in edge-23.11.1.