Closed someone-stole-my-name closed 10 months ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
Thanks for the detailed info @someone-stole-my-name. I've confirmed this issue with the repro steps you provided. Digging a little deeper it seems like there's an inconsistency between the proxy metrics (which are used as the data source for the success rate in the octopus graph) and the tap events (which are used as the data source for the live calls table).
Looking at an individual tap from the destination controller's proxy we have:
req id=4:1 proxy=in src=10.244.0.94:58162 dst=10.244.0.96:8086 tls=true :method=POST :authority=linkerd-dst-headless.linkerd.svc.cluster.local:8086 :path=/io.linkerd.proxy.destination.Destination/GetProfile
rsp id=4:1 proxy=in src=10.244.0.94:58162 dst=10.244.0.96:8086 tls=true :status=200 latency=589µs
end id=4:1 proxy=in src=10.244.0.94:58162 dst=10.244.0.96:8086 tls=true grpc-status=Unknown duration=5µs response-length=0B
we can see that the http status is 200 and the grpc-status is Unknown
(grpc status code 2). The live requests table interprets this as a failure because of the grpc-status.
However, if we look at the proxy metrics from the destination controller's proxy we see:
response_total{direction="inbound",authority="linkerd-dst-headless.linkerd.svc.cluster.local:8086",target_addr="10.244.0.96:8086",target_ip="10.244.0.96",target_port="8086",tls="true",client_id="default.default.serviceaccount.identity.linkerd.cluster.local",srv_group="",srv_kind="default",srv_name="all-unauthenticated",route_group="",route_kind="default",route_name="default",authz_group="",authz_kind="default",authz_name="all-unauthenticated",status_code="200",classification="success",grpc_status="",error=""} 4133
in the response_total metric, we see that the classification is success
and no grpc_status
is recorded.
I plan to investigate further to see if I can figure out why the classification and grpc_status on this metric isn't reflecting the same thing as the tap output.
This was fixed in edge-23.11.1.
What is the issue?
While debugging https://github.com/linkerd/linkerd2/issues/11065 I found something weird. When using v2.13 these failed requests do not count towards the success rate of destination, while in previous versions e.g. v2.11 they did.
How can it be reproduced?
Same as https://github.com/linkerd/linkerd2/issues/11065, add the manifests from the gist and see the difference when using 2.13 vs 2.11
Logs, error output, etc
With 2.11:
With 2.13:
output of
linkerd check -o short
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None