linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.67k stars 1.28k forks source link

Outbound HTTP endpoints seems to miscount "ready" endpoints with circuit breaking #12961

Open kflynn opened 2 months ago

kflynn commented 2 months ago

What is the issue?

I was trying to set up a Grafana dashboard to show circuit breaking behavior with the Faces demo: the Faces GUI calls through Emissary to the face workload at the entry point of this demo. I intentionally break the world by adding a face2 Deployment which always fails, and setting it up so that the face Service spans Pods created by both the face and face2 Deployments.

At this point, you can do PromQL queries and see

2024-08-14T18:29:26.957000: emissary.emissary -> face.faces (pending): 0
2024-08-14T18:29:26.957000: emissary.emissary -> face.faces (ready): 2

This is correct: both endpoints are active, circuit breaking isn't involved, and one would expect that when circuit breaking is turned on, then the breaker opening would result in 1 pending and 1 ready. Unfortunately, in the event you actually get

2024-08-14T18:29:36.993000: emissary.emissary -> face.faces (pending): 1
2024-08-14T18:29:36.993000: emissary.emissary -> face.faces (ready): 3

which is a bit surprising! Then, when the breaker is turned off, you get

2024-08-14T18:30:37.132000: emissary.emissary -> face.faces (pending): 0
2024-08-14T18:30:37.132000: emissary.emissary -> face.faces (ready): 4

So pending seems to work fine, but the ready endpoints seem to be miscounted.

How can it be reproduced?

Enable circuit breaking and force the breaker to open. Watch pending and ready endpoints as you go.

Logs, error output, etc

See above. 🙂

output of linkerd check -o short

:; linkerd check -o short Status check results are √

Environment

I'm using a kind cluster at the moment, K8s 1.30.3, Linkerd version edge-24.8.2.

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

kflynn commented 2 months ago

Whoops, I should've added that those lines of output are from running this PromQL query

outbound_http_balancer_endpoints{deployment="emissary", namespace="emissary", backend_name="face", backend_namespace="faces"}

and then formatting the values coming back with each endpoint_state, but of course it shows up in Grafana or whatever as well.

adleong commented 2 months ago

This looks like it might be similar to https://github.com/linkerd/linkerd2-proxy/pull/2928

olix0r commented 2 months ago

Are you able to provide the output of linkerd diagnostics proxy-metrics and kubectl logs against a client in this state? This should help shine light on the nature of the issue.