FailFast state for services in HttpRoute without endpoints

aaguilartablada commented 11 months ago

What is the issue?

We have 2 clusters in two different regions with virtual network peering. This allows us to use Linkerd multicluster in flat network mode. This is working as expected. After doing some experiments we see some strange behavior.

We have a very simple API (example-api) deployed in both clusters with one replica. We can configure the API any time to respond 200 or 500.

kubectl --context devops-test-we-002 -n example-dev get pod
NAME                                           READY   STATUS    RESTARTS   AGE
example-api-f67c85df4-7jk4l   2/2          Running   0                    21h

kubectl --context devops-test-ne-002 -n example-dev get pod
NAME                                               READY   STATUS    RESTARTS   AGE
example-api-849f4d987d-tkvtq   2/2         Running    0                   19h

We have replicated the Service applying circuit breaking.

apiVersion: v1
kind: Service
metadata:
  annotations:
    balancer.linkerd.io/failure-accrual: consecutive
  labels:
    mirror.linkerd.io/exported: remote-discovery
  name: example-api
  namespace: example-dev
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: http
  selector:
    app: example-api
  type: ClusterIP

kubectl --context devops-test-we-002 -n example-dev get svc
NAME                                    TYPE          CLUSTER-IP     EXTERNAL-IP   PORT(S)                        AGE
example-api                          ClusterIP   10.0.100.7         <none>              80/TCP                          15h
example-api-northeurope   ClusterIP   10.0.106.190     <none>              80/TCP                          3d20h

We have created a HttpRoute to split traffic 50/50.

apiVersion: policy.linkerd.io/v1beta3
kind: HTTPRoute
metadata:
  name: example-api
  namespace: example-dev
spec:
  parentRefs:
  - group: core
    kind: Service
    name: example-api
    port: 80
  rules:
  - backendRefs:
    - name: example-api
      port: 80
      weight: 1
    - name: example-api-northeurope
      port: 80
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /

Using K6 we start a test to launch 300 requests per seconds towards our example-api. At the beginning everything works as expected. Low latency responses and 300 req/sec.

When we set remote example-api to respond 500 errors we can see the circuit breaker acts and opens the circuit as expected. The problem (or strange behavior) is that every time Linkerd sends a request to check the status of remote endpoint the proxy gets stuck and make the test fail with 504 errors from Linkerd proxy. You can see the proxy logs below.

The effect in the K6 tests is as follows. You can see that everything is OK at the beginning until we set remote API to respond 500:

Captura de pantalla 2023-10-17 a las 9 37 57

IMPORTANT NOTE: when we have multiple replicas in the remote cluster the behavior is NOT reproduced. Instead, Linkerd proxy applies circuit breaking to the "bad" endpoint and continues working as expected.

How can it be reproduced?

Creating a multicluster environment in flat network mode and replicate services with one and only one endpoint.

Logs, error output, etc

[  1234.073345s] DEBUG ThreadId(01) outbound:accept{client.addr=10.2.0.118:50244}:proxy{addr=10.0.100.7:80}:http:service{ns=example-dev name=example-api-northeurope port=80}: tower::balance::p2c::service: updating from discover
[  1234.073427s]  WARN ThreadId(01) outbound:accept{client.addr=10.2.0.118:50244}:proxy{addr=10.0.100.7:80}:http: linkerd_stack::failfast: Service entering failfast after 3s
[  1234.073439s] DEBUG ThreadId(01) outbound:accept{client.addr=10.2.0.118:50244}:proxy{addr=10.0.100.7:80}:http: linkerd_stack::gate: Gate shut
[  1234.073446s] DEBUG ThreadId(01) outbound:accept{client.addr=10.2.0.118:50244}:proxy{addr=10.0.100.7:80}:http: linkerd_stack::failfast: Service in failfast
[  1234.073450s] DEBUG ThreadId(01) outbound:accept{client.addr=10.2.0.118:50244}:proxy{addr=10.0.100.7:80}:http: tower::buffer::worker: service.ready=true processing request
[  1234.073469s] DEBUG ThreadId(01) outbound:accept{client.addr=10.2.0.118:50244}:proxy{addr=10.0.100.7:80}:http:service{ns=example-dev name=example-api-northeurope port=80}: tower::balance::p2c::service: updating from discover
[  1234.073518s]  INFO ThreadId(01) outbound:accept{client.addr=10.2.0.118:50244}:proxy{addr=10.0.100.7:80}:http:rescue{client.addr=10.2.0.118:50244}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.0.100.7:80: route HTTPRoute.example-dev.example-api: backend Service.example-dev.example-api-northeurope:80: service in fail-fast error.sources=[route HTTPRoute.example-dev.example-api: backend Service.example-dev.example-api-northeurope:80: service in fail-fast, backend Service.example-dev.example-api-northeurope:80: service in fail-fast, service in fail-fast]
[  1234.073541s] DEBUG ThreadId(01) outbound:accept{client.addr=10.2.0.118:50244}:proxy{addr=10.0.100.7:80}:http: linkerd_app_core::errors::respond: Handling error on HTTP connection status=504 Gateway Timeout version=HTTP/1.1 close=true
[  1234.073624s] DEBUG ThreadId(01) outbound:accept{client.addr=10.2.0.118:50244}:proxy{addr=10.0.100.7:80}:http: hyper::proto::h1::io: flushed 303 bytes
[  1234.073656s] DEBUG ThreadId(01) outbound:accept{client.addr=10.2.0.118:50244}:proxy{addr=10.0.100.7:80}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
[  1234.073697s] DEBUG ThreadId(01) outbound:accept{client.addr=10.2.0.118:50244}: linkerd_app_core::serve: Connection closed
[  1234.073714s] DEBUG ThreadId(01) evict{key=Http(HttpSidecar { orig_dst: OrigDstAddr(10.0.100.7:80), version: HTTP/1, routes: Receiver { shared: Shared { value: RwLock(PhantomData<std::sync::rwlock::RwLock<linkerd_app_outbound::http::logical::Routes>>, RwLock { data: Policy(Http(Params { addr: Socket(10.0.100.7:80), meta: ParentRef(Resource { group: "core", kind: "Service", name: "example-api", namespace: "example-dev", section: None, port: Some(80) }), routes: [Route { hosts: [], rules: [Rule { matches: [MatchRequest { path: Some(Prefix("/")), headers: [], query_params: [], method: None }], policy: RoutePolicy { meta: Resource { group: "policy.linkerd.io", kind: "HTTPRoute", name: "example-api", namespace: "example-dev", section: None, port: None }, filters: [], distribution: RandomAvailable([(RouteBackend { filters: [], backend: Backend { meta: Resource { group: "core", kind: "Service", name: "example-api", namespace: "example-dev", section: None, port: Some(80) }, queue: Queue { capacity: 100, failfast_timeout: 3s }, dispatcher: BalanceP2c(PeakEwma(PeakEwma { decay: 10s, default_rtt: 30ms }), DestinationGet { path: "example-api.example-dev.svc.cluster.local:80" }) }, request_timeout: None }, 1), (RouteBackend { filters: [], backend: Backend { meta: Resource { group: "core", kind: "Service", name: "example-api-northeurope", namespace: "example-dev", section: None, port: Some(80) }, queue: Queue { capacity: 100, failfast_timeout: 3s }, dispatcher: BalanceP2c(PeakEwma(PeakEwma { decay: 10s, default_rtt: 30ms }), DestinationGet { path: "example-api-northeurope.example-dev.svc.cluster.local:80" }) }, request_timeout: None }, 1)]), request_timeout: None, failure_policy: StatusRanges([500..=599]) } }] }], backends: [Backend { meta: Default { name: "service" }, queue: Queue { capacity: 100, failfast_timeout: 3s }, dispatcher: BalanceP2c(PeakEwma(PeakEwma { decay: 10s, default_rtt: 30ms }), DestinationGet { path: "example-api.example-dev.svc.cluster.local:80" }) }, Backend { meta: Resource { group: "core", kind: "Service", name: "example-api-northeurope", namespace: "example-dev", section: None, port: Some(80) }, queue: Queue { capacity: 100, failfast_timeout: 3s }, dispatcher: BalanceP2c(PeakEwma(PeakEwma { decay: 10s, default_rtt: 30ms }), DestinationGet { path: "example-api-northeurope.example-dev.svc.cluster.local:80" }) }, Backend { meta: Resource { group: "core", kind: "Service", name: "example-api", namespace: "example-dev", section: None, port: Some(80) }, queue: Queue { capacity: 100, failfast_timeout: 3s }, dispatcher: BalanceP2c(PeakEwma(PeakEwma { decay: 10s, default_rtt: 30ms }), DestinationGet { path: "example-api.example-dev.svc.cluster.local:80" }) }], failure_accrual: ConsecutiveFailures { max_failures: 7, backoff: ExponentialBackoff { min: 1s, max: 60s, jitter: 0.5 } } })) }), version: Version(0), is_closed: false, ref_count_rx: 44 }, version: Version(0) } })}: linkerd_idle_cache: Awaiting idleness
[  1234.130435s] DEBUG ThreadId(01) inbound:accept{client.addr=188.86.156.54:51579}: linkerd_app_core::serve: Connection closed
[  1234.407927s] DEBUG ThreadId(01) outbound:accept{client.addr=10.2.0.118:58704}:proxy{addr=10.0.100.7:80}:service{ns=example-dev name=example-api-northeurope port=80}:endpoint{addr=10.3.0.238:80}: linkerd_app_outbound::http::breaker::consecutive_failures: Probation
[  1234.407979s] DEBUG ThreadId(01) outbound:accept{client.addr=10.2.0.118:58704}:proxy{addr=10.0.100.7:80}:service{ns=example-dev name=example-api-northeurope port=80}:endpoint{addr=10.3.0.238:80}: linkerd_stack::gate: Gate limited
[  1234.407989s] DEBUG ThreadId(01) outbound:accept{client.addr=10.2.0.118:50244}:proxy{addr=10.0.100.7:80}:http:service{ns=example-dev name=example-api-northeurope port=80}: tower::balance::p2c::service: updating from discover
[  1234.408009s] DEBUG ThreadId(01) linkerd_stack::gate: Gate opened
[  1234.408012s]  INFO ThreadId(01) linkerd_stack::failfast: Service has recovered

output of `linkerd check -o short`

IMPORTANT: linkerd-multicluster check fails because we communicate Linkerd control planes through private virtual network integration as you can see in the Kubernetes API endpoint.

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2023-10-19T07:39:00Z
    see https://linkerd.io/2.14/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.14.0 but the latest stable version is 2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.14.0 but the latest stable version is 2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
    * linkerd-destination-7bd5bf99fc-97g4h (stable-2.14.0)
    * linkerd-destination-7bd5bf99fc-cgwmh (stable-2.14.0)
    * linkerd-destination-7bd5bf99fc-h8vm2 (stable-2.14.0)
    * linkerd-identity-6f8985f4c8-2dv9l (stable-2.14.0)
    * linkerd-identity-6f8985f4c8-hrrnc (stable-2.14.0)
    * linkerd-identity-6f8985f4c8-knhvs (stable-2.14.0)
    * linkerd-proxy-injector-6c9ccfc67b-2rqjd (stable-2.14.0)
    * linkerd-proxy-injector-6c9ccfc67b-8r2sk (stable-2.14.0)
    * linkerd-proxy-injector-6c9ccfc67b-l9z78 (stable-2.14.0)
    * metrics-api-6d968fc75b-lvmlx (stable-2.14.0)
    * prometheus-75c8d67658-dtdp6 (stable-2.14.0)
    * tap-57648b554-lnqv9 (stable-2.14.0)
    * tap-injector-6d55948885-cgfd9 (stable-2.14.0)
    * web-6fff6c7dd9-rqcl4 (stable-2.14.0)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints

linkerd-multicluster
--------------------
× remote cluster access credentials are valid
            * failed to connect to API for cluster: [northeurope]: Get "https://10.3.128.4/version?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    see https://linkerd.io/2.14/checks/#l5d-smc-target-clusters-access for hints
× clusters share trust anchors
    Problematic clusters:
    * northeurope: unable to fetch anchors: Get "https://10.3.128.4/api/v1/namespaces/linkerd/configmaps/linkerd-config?timeout=30s": context deadline exceeded
    see https://linkerd.io/2.14/checks/#l5d-multicluster-clusters-share-anchors for hints
× probe services able to communicate with all gateway mirrors
        wrong number (0) of probe gateways for target cluster northeurope
    see https://linkerd.io/2.14/checks/#l5d-multicluster-gateways-endpoints for hints
‼ all mirror services are part of a Link
        mirror service costs-calculator-ne.kube-system is not part of any Link
    mirror service costs-calculator-platform-gcc-devops-aks-test-ne-002.kube-system is not part of any Link
    see https://linkerd.io/2.14/checks/#l5d-multicluster-orphaned-services for hints
‼ multicluster extension proxies are up-to-date
    some proxies are not running the current version:
    * linkerd-service-mirror-northeurope-59d74cf46-b4fps (stable-2.14.0)
    see https://linkerd.io/2.14/checks/#l5d-multicluster-proxy-cp-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
    * linkerd-destination-7bd5bf99fc-97g4h (stable-2.14.0)
    * linkerd-destination-7bd5bf99fc-cgwmh (stable-2.14.0)
    * linkerd-destination-7bd5bf99fc-h8vm2 (stable-2.14.0)
    * linkerd-identity-6f8985f4c8-2dv9l (stable-2.14.0)
    * linkerd-identity-6f8985f4c8-hrrnc (stable-2.14.0)
    * linkerd-identity-6f8985f4c8-knhvs (stable-2.14.0)
    * linkerd-proxy-injector-6c9ccfc67b-2rqjd (stable-2.14.0)
    * linkerd-proxy-injector-6c9ccfc67b-8r2sk (stable-2.14.0)
    * linkerd-proxy-injector-6c9ccfc67b-l9z78 (stable-2.14.0)
    * metrics-api-6d968fc75b-lvmlx (stable-2.14.0)
    * prometheus-75c8d67658-dtdp6 (stable-2.14.0)
    * tap-57648b554-lnqv9 (stable-2.14.0)
    * tap-injector-6d55948885-cgfd9 (stable-2.14.0)
    * web-6fff6c7dd9-rqcl4 (stable-2.14.0)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints

linkerd-failover
----------------
‼ Linkerd extension command linkerd-failover exists
    exec: "linkerd-failover": executable file not found in $PATH
    see https://linkerd.io/2.14/checks/#extensions for hints

linkerd-smi
-----------
‼ Linkerd extension command linkerd-smi exists
    exec: "linkerd-smi": executable file not found in $PATH
    see https://linkerd.io/2.14/checks/#extensions for hints

Environment

kubernetes version: 1.27.4 cluster environment: AKS host os: linux linkerd version: 2.14.0

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

kflynn commented 11 months ago

Hey @aaguilartablada, thanks for the information! Can you clarify -- do you mean that whenever Linkerd probes, the probe gets a failure that's visible to the client? or do you mean that whenever Linkerd probes, then that proxy gets stuck failing for some time?

The former is expected behavior: Linkerd's circuit breaking uses actual client requests for the probes. The latter, though, is very much not expected. 😐

aaguilartablada commented 11 months ago

Hi @kflynn! Very nice to discuss this with you. I've watched a lot of your academy sessions.

I'm afraid that what happens is the latter. I have repeated the tests in order to show you what I see. To launch the new tests I've configured a more aggressive circuit breaker for my 'example-api' service:

balancer.linkerd.io/failure-accrual: consecutive
balancer.linkerd.io/failure-accrual-consecutive-jitter-ratio: "0"
balancer.linkerd.io/failure-accrual-consecutive-max-failures: "5"
balancer.linkerd.io/failure-accrual-consecutive-max-penalty: 10s
balancer.linkerd.io/failure-accrual-consecutive-min-penalty: 10s

The K6 javascript that I use to test is written to act as 'open model' to be able to create or destroy virtual users to maintain 100 req/seg depending on the situation:

import http from 'k6/http';
import { check } from 'k6';

export const options = {
  scenarios: {
    openmodel: {
      executor: 'ramping-arrival-rate',
      startRate: 0,
      preAllocatedVUs: 1,
      maxVUs: 100,
      stages: [
        { target: 100, duration: '20s' },
        { target: 100, duration: '1200s' },
        { target: 0, duration: '20s' },
      ],
    },
  },
};

export default function () {
  const result = http.get(`https://${__ENV.ENDPOINT}/k6`);
  check(result, {
    'http response status code is 200': (r) => r.status === 200,
    'WEST EUROPE': (r) => r.body.includes('WEST'),
    'NORTH EUROPE': (r) => r.body.includes('NORTH'),
  });
}

The scenario is:

multicluster with flat network.
example-api Service in Northeurope cluster replicated in Westeurope cluster.
HttpRoute resource in Westeurope cluster splitting traffic towards example-api (50% to Westeurope service; 50% to Northeurope service)
only one example-api replica in each cluster.

The results:

Test1

The throughput is affected. It can't maintain 100 req/sec when the circuit is opened for Northeurope endpoint.
I expected 0.1 req/sec for unsuccessful requests, but it's over 10 req/sec.
The virtual users graph shows that something is stuck, as virtual users grow dramatically exactly each 10 seconds (the max and min penalty for circuit breaker).

The biggest problem is that the client sees this problem, because K6 sees http codes 500 (expected) and http codes 504 (unexpected). This is the summary of requests seen from K6 point of view:

Captura de pantalla 2023-10-19 a las 9 11 23

Let's repeat the test with a little change: two example-api replicas in each cluster. It's very interesting the result. The process will be: after some minutes I'll force circuit breaker for one replica in Northeurope, some minutes later I'll open the circuit for the other Northeurope replica and after a moment I'll close one endpoint circuit:

Test2

While there is at least one healthy endpoint in Northeurope everything works as expected.
(Please note that 'Unsuccessful HTTP Request' graph is now in logarithmic scale in order to compare 0.1 with 10) While one endpoint circuit is opened, unsuccessful requests is around 0.1 as expected.
As soon as I open the circuit for the second endpoint the scenario described in the first test occurs again.
As soon as an endpoint comes back to healthy state the scenario recovers to the expected situation.

Let me know if I can give you more information or launch some other kind of test.

aaguilartablada commented 11 months ago

Hi again @kflynn

Maybe this information is also useful. I have repeated the test with 2 different deployments and services in the same cluster in order to check whether it is a problem in flat-network replication.

The results are the same. If one of the 2 services behind the HttpRoute has all its endpoints with circuit breaking applied the issue is reproduced.

kflynn commented 11 months ago

@aaguilartablada So it does not seem to depend on whether multicluster is involved -- you get the same behavior with an HTTPRoute in a single cluster?

aaguilartablada commented 11 months ago

Exactly @kflynn . I repeated my tests with a HTTPRoute pointing to 2 services located in the same cluster and it behaved the same way.

kflynn commented 9 months ago

Hey @aaguilartablada, sorry for the delay on this one! I had to learn a lot more about details of circuit breaking and such to be able to answer this. 😂

This behavior that you're seeing is the intersection of Linkerd's circuit breaking and load balancing implementations. The critical bit is that when the Linkerd proxy sees an incoming request, it figures out the destination workload, then hands the request off to the Linkerd load balancer for that workload. "Load balancer" here refers to the bit in Linkerd that distributes requests to individual backends, and there are some subtleties in it.

In normal behavior, the request gets put on a queue for the load balancer. The load balancer spends its life pulling things off its queue and dispatching them to the "best" of the available pool of backends. So far so good. When circuit breaking decides to open the circuit to a particular backend, what happens is basically that that backend is removed from the pool of available backends.

This is fine until, of course, there are no available backends in the pool. When that state persists for more than a few seconds, the load balancer enters failfast:

All requests remaining in the queue, waiting for backends that are no longer considered to be present, immediately get a 504 response.
Any new request arriving while the load balancer is still in failfast get an immediate 503 response.

Note the different responses there! The 504 happens only at the point that the load balancer enters failfast; technically, the 503 is the load balancer implementing load shedding, which can happen if the load balancer is in failfast or, for example, if the queue gets full because the load balancer suddenly gets slammed with a flood of incoming requests.

To get out of failfast, some backend has to become available again. If you got into failfast due to circuit breaking marking all your backends unavailable, the most likely way you'll get out is that circuit breaking will allow a probe request to go through, it'll succeed, and then the backend will be marked available again. At that point the load balancer will come out of failfast and things will start being handled normally again.

So, ultimately, I think you're seeing expected behavior, where the circuit opens on your only backend and the 504s are being sent to requests that were already in the load balancer queue at the moment the load balancer entered failfast. Later, the circuit is allowed to close, and things start working again -- and, of course, if you have multiple replicas, you have multiple backends, so you shouldn't see 504s unless you force the circuit breaker for all of them simultaneously.

And, again, sorry it took so long to get the answer here!

olix0r commented 8 months ago

I think the summary here is that Linkerd is working as intended? I'm going to close this as there's been no further discussion here.

aaguilartablada commented 8 months ago

OH! So sorry. I was very busy last weeks and I didn't realize @kflynn replied. I don't think it is working as expected. I think there was a misunderstanding or I didn't explain correctly.

Let's suppose I have 2 services: service A and service B. Behind service A I have 3 pods: A1, A2 and A3. Behind service B I have 3 pods: B1, B2 and B3.

We configure a HTTPRoute to route 50/50 to service A and service B. Pods A1, A2 and A3 run healthy forever, they are perfect. The problem is as follows:

I force B3 to respond HTTP/500 errors. In this case Linkerd opens the circuit for B3 and traffic is not affected.
I force B2 to respond HTTP/500 errors. In this case Linkerd opens the circuit for B2 and traffic is not affected.
I force B1 (now is the only pod running behind service B) to respond HTTP/500 errors. Now the problem appears.

In this situation, what I expect is Linkerd to have the circuit open for B1, B2 and B3 while sending 100% of traffic to pods A1, A2 and A3 with some of the request sent to B1, B2 and B3 to test them. What I see is the situation described above. Every time Linkerd test one pod behind service B the Linkerd proxy get stuck.

The point is that no matter how many pods have the circuit open in service B. While there is, at least, one pod running in service B everything works as expected. As soon as service B has all its pod open, the proxy do strange things even though it has service A healthy.

Again, so sorry @kflynn for not noticing your answer.

aaguilartablada commented 8 months ago

@olix0r , can we reopen the issue?

linkerd / linkerd2