Closed aaguilartablada closed 8 months ago
Hey @aaguilartablada, thanks for the information! Can you clarify -- do you mean that whenever Linkerd probes, the probe gets a failure that's visible to the client? or do you mean that whenever Linkerd probes, then that proxy gets stuck failing for some time?
The former is expected behavior: Linkerd's circuit breaking uses actual client requests for the probes. The latter, though, is very much not expected. 😐
Hi @kflynn! Very nice to discuss this with you. I've watched a lot of your academy sessions.
I'm afraid that what happens is the latter. I have repeated the tests in order to show you what I see. To launch the new tests I've configured a more aggressive circuit breaker for my 'example-api' service:
balancer.linkerd.io/failure-accrual: consecutive
balancer.linkerd.io/failure-accrual-consecutive-jitter-ratio: "0"
balancer.linkerd.io/failure-accrual-consecutive-max-failures: "5"
balancer.linkerd.io/failure-accrual-consecutive-max-penalty: 10s
balancer.linkerd.io/failure-accrual-consecutive-min-penalty: 10s
The K6 javascript that I use to test is written to act as 'open model' to be able to create or destroy virtual users to maintain 100 req/seg depending on the situation:
import http from 'k6/http';
import { check } from 'k6';
export const options = {
scenarios: {
openmodel: {
executor: 'ramping-arrival-rate',
startRate: 0,
preAllocatedVUs: 1,
maxVUs: 100,
stages: [
{ target: 100, duration: '20s' },
{ target: 100, duration: '1200s' },
{ target: 0, duration: '20s' },
],
},
},
};
export default function () {
const result = http.get(`https://${__ENV.ENDPOINT}/k6`);
check(result, {
'http response status code is 200': (r) => r.status === 200,
'WEST EUROPE': (r) => r.body.includes('WEST'),
'NORTH EUROPE': (r) => r.body.includes('NORTH'),
});
}
The scenario is:
The results:
The biggest problem is that the client sees this problem, because K6 sees http codes 500 (expected) and http codes 504 (unexpected). This is the summary of requests seen from K6 point of view:
Let's repeat the test with a little change: two example-api replicas in each cluster. It's very interesting the result. The process will be: after some minutes I'll force circuit breaker for one replica in Northeurope, some minutes later I'll open the circuit for the other Northeurope replica and after a moment I'll close one endpoint circuit:
Let me know if I can give you more information or launch some other kind of test.
Hi again @kflynn
Maybe this information is also useful. I have repeated the test with 2 different deployments and services in the same cluster in order to check whether it is a problem in flat-network replication.
The results are the same. If one of the 2 services behind the HttpRoute has all its endpoints with circuit breaking applied the issue is reproduced.
@aaguilartablada So it does not seem to depend on whether multicluster is involved -- you get the same behavior with an HTTPRoute in a single cluster?
Exactly @kflynn . I repeated my tests with a HTTPRoute pointing to 2 services located in the same cluster and it behaved the same way.
Hey @aaguilartablada, sorry for the delay on this one! I had to learn a lot more about details of circuit breaking and such to be able to answer this. 😂
This behavior that you're seeing is the intersection of Linkerd's circuit breaking and load balancing implementations. The critical bit is that when the Linkerd proxy sees an incoming request, it figures out the destination workload, then hands the request off to the Linkerd load balancer for that workload. "Load balancer" here refers to the bit in Linkerd that distributes requests to individual backends, and there are some subtleties in it.
In normal behavior, the request gets put on a queue for the load balancer. The load balancer spends its life pulling things off its queue and dispatching them to the "best" of the available pool of backends. So far so good. When circuit breaking decides to open the circuit to a particular backend, what happens is basically that that backend is removed from the pool of available backends.
This is fine until, of course, there are no available backends in the pool. When that state persists for more than a few seconds, the load balancer enters failfast:
Note the different responses there! The 504 happens only at the point that the load balancer enters failfast; technically, the 503 is the load balancer implementing load shedding, which can happen if the load balancer is in failfast or, for example, if the queue gets full because the load balancer suddenly gets slammed with a flood of incoming requests.
To get out of failfast, some backend has to become available again. If you got into failfast due to circuit breaking marking all your backends unavailable, the most likely way you'll get out is that circuit breaking will allow a probe request to go through, it'll succeed, and then the backend will be marked available again. At that point the load balancer will come out of failfast and things will start being handled normally again.
So, ultimately, I think you're seeing expected behavior, where the circuit opens on your only backend and the 504s are being sent to requests that were already in the load balancer queue at the moment the load balancer entered failfast. Later, the circuit is allowed to close, and things start working again -- and, of course, if you have multiple replicas, you have multiple backends, so you shouldn't see 504s unless you force the circuit breaker for all of them simultaneously.
And, again, sorry it took so long to get the answer here!
I think the summary here is that Linkerd is working as intended? I'm going to close this as there's been no further discussion here.
OH! So sorry. I was very busy last weeks and I didn't realize @kflynn replied. I don't think it is working as expected. I think there was a misunderstanding or I didn't explain correctly.
Let's suppose I have 2 services: service A and service B. Behind service A I have 3 pods: A1, A2 and A3. Behind service B I have 3 pods: B1, B2 and B3.
We configure a HTTPRoute to route 50/50 to service A and service B. Pods A1, A2 and A3 run healthy forever, they are perfect. The problem is as follows:
In this situation, what I expect is Linkerd to have the circuit open for B1, B2 and B3 while sending 100% of traffic to pods A1, A2 and A3 with some of the request sent to B1, B2 and B3 to test them. What I see is the situation described above. Every time Linkerd test one pod behind service B the Linkerd proxy get stuck.
The point is that no matter how many pods have the circuit open in service B. While there is, at least, one pod running in service B everything works as expected. As soon as service B has all its pod open, the proxy do strange things even though it has service A healthy.
Again, so sorry @kflynn for not noticing your answer.
@olix0r , can we reopen the issue?
What is the issue?
We have 2 clusters in two different regions with virtual network peering. This allows us to use Linkerd multicluster in flat network mode. This is working as expected. After doing some experiments we see some strange behavior.
We have a very simple API (example-api) deployed in both clusters with one replica. We can configure the API any time to respond 200 or 500.
We have replicated the Service applying circuit breaking.
We have created a HttpRoute to split traffic 50/50.
Using K6 we start a test to launch 300 requests per seconds towards our example-api. At the beginning everything works as expected. Low latency responses and 300 req/sec.
When we set remote example-api to respond 500 errors we can see the circuit breaker acts and opens the circuit as expected. The problem (or strange behavior) is that every time Linkerd sends a request to check the status of remote endpoint the proxy gets stuck and make the test fail with 504 errors from Linkerd proxy. You can see the proxy logs below.
The effect in the K6 tests is as follows. You can see that everything is OK at the beginning until we set remote API to respond 500:
IMPORTANT NOTE: when we have multiple replicas in the remote cluster the behavior is NOT reproduced. Instead, Linkerd proxy applies circuit breaking to the "bad" endpoint and continues working as expected.
How can it be reproduced?
Creating a multicluster environment in flat network mode and replicate services with one and only one endpoint.
Logs, error output, etc
output of
linkerd check -o short
IMPORTANT: linkerd-multicluster check fails because we communicate Linkerd control planes through private virtual network integration as you can see in the Kubernetes API endpoint.
Environment
kubernetes version: 1.27.4 cluster environment: AKS host os: linux linkerd version: 2.14.0
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None