Circuit breaker does not appear to work

georgeharley commented 6 years ago

Is this a BUG or FEATURE REQUEST?: Bug

Did you review existing epics or issues to identify if this already being worked on? (please try to add the correct labels and epics):

Bug: Y

What Version of Istio and Kubernetes are you using, where did you get Istio from, Installation details

istioctl version 0.3.0
kubectl version 1.8.4

The cluster is running on GKE and was installed using the Istio GKE Deployment Manager.

Is Istio Auth enabled or not ? In the deployment manager the "Enable mutualTLS authentication" box was left checked so I assume this means Istio Auth is enabled.

What happened: There are two services running in the cluster: serviceA (which is externally accessible) and serviceB. ServiceA calls serviceB as part of its request handling logic. A simple route rule for ServiceB was created and later on a destination policy was created declaring a simple circuit breaker for ServiceB. The circuit breaker included an httpConsecutiveErrors value of 1. After checking that external requests to serviceA were successful and verifying that each response contained data from serviceB (confirming that serviceB was being called), serviceB was made to return an HTTP 500 internal server error response.

Route Rule for ServiceB

apiVersion: config.istio.io/v1alpha2
kind: RouteRule
metadata:
  name: bookstore-default
spec:
  destination:
    name: bookstore
  route:
  - labels:
      version: v1
    weight: 100

Destination Policy for ServiceB

apiVersion: config.istio.io/v1alpha2
kind: DestinationPolicy
metadata:
  name: bookstore-cb-policy
  namespace: default
spec:
  destination:
    name: bookstore
    labels:
      version: v1
  circuitBreaker:
    simpleCb:
      httpConsecutiveErrors: 1
      httpDetectionInterval: 1s
      httpMaxEjectionPercent: 100
      httpMaxPendingRequests: 1
      maxConnections: 1
      sleepWindow: 10m

What you expected to happen: I expected to see one of serviceB's pods ejected from the load balancing pool for a period of time no shorther than the value of sleepWindow. What actually happened was that the pods associated with serviceB remained unchanged. I am not aware of any other way of verifying the correctness of the destination policy containing the circuit breaker other than sending traffic to the destination service and arranging for an error to be raised.

How to reproduce it: Including a cookie "fail=yes" in any external request to exposed serviceA will result in the serviceB call for that request returning an HTTP 500 response. The serviceA implementation just forwards the cookie value to serviceB. The serviceB implementation checks for the existence of the "fail=yes" cookie and if found immediately returns the 500 error, otheriwse it completes normally and returns a 200 response code.

Feature Request: N

Describe the feature: N/A

georgeharley commented 6 years ago

After making a change to the destination policy so that the httpDetectionInterval is 10 seconds rather than 1 second I believe there is evidence that circuit breaking is going on. The evidence comes from analysis of a serviceB pod's Envoy proxy log which shows no further requests being handled after it was made to return a 500 error (via the cookie mechanism described in the previous comment). Once the sleepWindow value expires then the proxy log does indeed show that the pod starts to get requests routed to it according to the current load balancing algorithm.

Apart from peeking in the pod proxy logs is there any other way that developers or administrators can check that a running pod is currently ejected from its load balancing pool? Can't readily see any tell-tale signs in the Kubernetes dashboard so perhaps there's a CLI way of finding this out? As things stand it seems a bit subtle :-)

One other observation: if the load balancer pool contains just two pods and both of them cause the configured circuit breaker to trip (in the previosuly described scenario this could be done by using the request cookie to make both pods return a 500 HTTP error) then it seems that the sleep window for both pods immediately expires and they both get automatically added back into the load balancing pool where they can start handling requests again. That's what I've observed anyway. Does it sound like it is behaving as designed? Is there any documentation available that describes what is supposed to happen in this scenario?

cjremshaw commented 6 years ago

I encountered this as well with httpMaxEjectionPercent set to 100 and using istio-0.4.0:

# set route rule to v2 reviews only

cat <<EOF | kubectl create -f -
apiVersion: config.istio.io/v1alpha2
kind: RouteRule
metadata:
  name: reviews-only-v2
spec:
  destination:
    name: reviews
  route:
  - labels:
      version: v2
EOF

# Now add the ability to fail when user=failme

cat <<EOF | kubectl create -f -
apiVersion: config.istio.io/v1alpha2
kind: RouteRule
metadata:
  name: ratings-abort
spec:
   destination:
     name: ratings
   match:
    request:
      headers:
        cookie:
          regex: "^(.*?;)?(user=failme)(;.*)?$"
   route:
   - labels:
       version: v1
   httpFault:
     abort:
       percent: 100
       httpStatus: 500
EOF

# Add circuitbreaker

cat <<EOF | kubectl create -f -
apiVersion: config.istio.io/v1alpha2
kind: DestinationPolicy
metadata:
  name: ratings-cb-policy
  namespace: default
spec:
  destination:
    name: ratings
    labels:
      version: v1
  circuitBreaker:
    simpleCb:
      maxConnections: 100
      httpConsecutiveErrors: 5
      sleepWindow: 5m
      httpDetectionInterval: 10s
      httpMaxEjectionPercent: 100
EOF

Reloading as user "failme" multiple times (which shows no ratings and should trigger the circuit breaker) has no effect once logged out, and ratings continue to show. Expect that ratings would NOT show for logged-out user for 5m

Update w/ logs to show 8 consecutive errors (which should have opened the circuit for 5m) and a 200 response less than 2 seconds later and again more than 2 minutes later.

[2018-01-15T19:52:14.843Z] "GET /ratings/0 HTTP/1.1" 500 FI 0 18 3 - "-" "Apache-CXF/3.1.11" "[REDACTED]" "ratings:9080" "-"
[2018-01-15T19:52:21.110Z] "GET /ratings/0 HTTP/1.1" 500 FI 0 18 12 - "-" "Apache-CXF/3.1.11" "[REDACTED]" "ratings:9080" "-"
[2018-01-15T19:52:22.120Z] "GET /ratings/0 HTTP/1.1" 500 FI 0 18 1 - "-" "Apache-CXF/3.1.11" "[REDACTED]" "ratings:9080" "-"
[2018-01-15T19:52:22.892Z] "GET /ratings/0 HTTP/1.1" 500 FI 0 18 2 - "-" "Apache-CXF/3.1.11" "[REDACTED]" "ratings:9080" "-"
[2018-01-15T19:52:23.665Z] "GET /ratings/0 HTTP/1.1" 500 FI 0 18 1 - "-" "Apache-CXF/3.1.11" "[REDACTED]" "ratings:9080" "-"
[2018-01-15T19:52:24.615Z] "GET /ratings/0 HTTP/1.1" 500 FI 0 18 0 - "-" "Apache-CXF/3.1.11" "[REDACTED]" "ratings:9080" "-"
[2018-01-15T19:52:25.913Z] "GET /ratings/0 HTTP/1.1" 500 FI 0 18 1 - "-" "Apache-CXF/3.1.11" "[REDACTED]" "ratings:9080" "-"
[2018-01-15T19:52:28.732Z] "GET /ratings/0 HTTP/1.1" 500 FI 0 18 14 - "-" "Apache-CXF/3.1.11" "[REDACTED]" "ratings:9080" "-"
[2018-01-15T19:52:30.283Z] "GET /ratings/0 HTTP/1.1" 200 - 0 48 44 39 "-" "Apache-CXF/3.1.11" "[REDACTED]" "ratings:9080" "[REDACTED]:9080"
[2018-01-15T19:54:49.845Z] "GET /ratings/0 HTTP/1.1" 200 - 0 48 29 24 "-" "Apache-CXF/3.1.11" "[REDACTED]" "ratings:9080" "[REDACTED]:9080"

georgeharley commented 6 years ago

@cjremshaw I wonder if what we have separately seen relates to Envoy's panic threshold?

GnawNom commented 6 years ago

I'm also encountering problems with circuit breaking. I have a similar setup, but A continually sends GET requests to B (which always returns 5xx). I'd expect "no healthy upstreams" to be returned after httpConsecutiveErrors are exhausted, but the 5xx still come through. I suspected it might be because I only have one pod in B's deployment, so I set httpMaxEjectionPercent = 100 but that doesn't seem to help.

I dug through envoy's documentation to see what kind of circuit breaking parameters are available and nothing seems to match httpConsecutiveErrors.

priority
(RoutingPriority) The RoutingPriority the specified CircuitBreaker settings apply to.
max_connections
(UInt32Value) The maximum number of connections that Envoy will make to the upstream cluster. If not specified, the default is 1024.
max_pending_requests
(UInt32Value) The maximum number of pending requests that Envoy will allow to the upstream cluster. If not specified, the default is 1024.
max_requests
(UInt32Value) The maximum number of parallel requests that Envoy will make to the upstream cluster. If not specified, the default is 1024.
max_retries
(UInt32Value) The maximum number of parallel retries that Envoy will allow to the upstream cluster. If not specified, the default is 3.

(https://www.envoyproxy.io/docs/envoy/latest/api-v2/cds.proto.html#envoy-api-msg-circuitbreakers)

Unless the ejection for httpConsecutiveErrors is implemented in istio itself, I don't think envoy supports circuitbreaking for HTTP 5xx responses. But if that's the case, it's confusing why the documentation would list that as a configurable field in SimpleCircuitBreakerPolicy.

GnawNom commented 6 years ago

Hmm I was mistaken. It seems like the functionality is located under Outlier Detection: https://www.envoyproxy.io/docs/envoy/latest/api-v1/cluster_manager/cluster_outlier_detection.html?highlight=outlier

GnawNom commented 6 years ago

@georgeharley I was able to find some stats related to outlier detection, makes debugging less of a mystery.

luster.out.pinger-b.istio-poc-dev.svc.cluster.local|http|version=broken.membership_change: 6
cluster.out.pinger-b.istio-poc-dev.svc.cluster.local|http|version=broken.membership_healthy: 4
cluster.out.pinger-b.istio-poc-dev.svc.cluster.local|http|version=broken.membership_total: 5
cluster.out.pinger-b.istio-poc-dev.svc.cluster.local|http|version=broken.outlier_detection.ejections_active: 1
cluster.out.pinger-b.istio-poc-dev.svc.cluster.local|http|version=broken.outlier_detection.ejections_consecutive_5xx: 68
cluster.out.pinger-b.istio-poc-dev.svc.cluster.local|http|version=broken.outlier_detection.ejections_overflow: 55
cluster.out.pinger-b.istio-poc-dev.svc.cluster.local|http|version=broken.outlier_detection.ejections_success_rate: 0
cluster.out.pinger-b.istio-poc-dev.svc.cluster.local|http|version=broken.outlier_detection.ejections_total: 13

kubectl exec -ti --namespace=$NAMESPACE $POD_NAME -- curl localhost:15000/stats | grep outlier_detection

where POD_NAME is a pod that is sending requests to a service with circuit_breaking enabled.

jeson114 commented 6 years ago

Facing the same issue . Setup an http abort on a service which always return 500 . set consecutiveErrors: 1 . But the circuit breaker never trips . Any solution to this @georgeharley ?

UnwashedMeme commented 6 years ago

I'm seeing this as well with an external service defined with a ServiceEntry. When I setup a loop I can see it being ejected, but then the ejection goes away almost immediately. Looking at some stats I think I can reject some hypothesis:

(all stats prefixed with cluster.outbound|80||enrollments-...

It isn't a panic scenario: .lb_healthy_panic: 0
They definitely are being ejected: .outlier_detection.ejections_detected_consecutive_gateway_failure: 952 but .ejections_active never stays non-zero for more than a few seconds, even when I have baseEjectionTime high. I've confirmed envoy has this config by also going to /config_dump.
.membership_change: 12036; that's really big, I have 3 endpoints configured in the ServiceEntry and watching the stats this increments by 3 about every 5s.

If I setup something to pound the endpoint that fails, I can see a member being ejected, but when the 5s membership change gets updated all of them go back to "healthy".

My hypothesis is that istio is updating the envoy config with information about the service every 5s or so, and that doing so resets envoy's outlier detection such that it sees everything healthy again.

istio / old_issues_repo

Circuit breaker does not appear to work #140