istio / api

API definitions for the Istio project
Apache License 2.0
456 stars 552 forks source link

OutlierDetection documentation wrong for error types #909

Open howardjohn opened 5 years ago

howardjohn commented 5 years ago

https://istio.io/docs/reference/config/networking/v1alpha3/destination-rule/#OutlierDetection

Under consecutiveErrors: 502, 503, and 504s only, 500 codes will not trigger the outlier detection

But in the docs above:

For HTTP services, hosts that continually return 5xx errors for API calls are ejected from the pool for a pre-defined period of time.

So sounds like the docs are incorrect here. It will only work for 502/503/504 from looking at the code and https://github.com/istio/api/pull/617

codependent commented 5 years ago

@howardjohn Thanks for clearing it up.

Just one question why isn't 500 status code included in the circuit breaking? In my view a service that returns continuously Internal Server Errors should also be also ejected from the pool. Otherwise we are keeping a service that is failing for whatever reason.

howardjohn commented 5 years ago

Some context from the change #617 that introduced it:

We try not to expose the plethora of sometimes confusing Envoy options to end users, in the routing api.

Within a mesh, gateway errors will be more common (502/503/504) while most sensible external services will return a 503 to shed load.

Secondly we just made the outlier detection generic to both tcp and http. The consecutive gateway error applies only to http and will make no sense in tcp context.

I also feel that 500 error code is not something indicative of overload. The whole idea behind outliers is to remove overloaded servers from the lb pool.

We don’t have very many users relying on this behavior I think. We kept it intentionally generic so that we can switch to a more specific error code in future (which happens to be now).

codependent commented 5 years ago

I understand the motivation now however I think there's another valid scenario that is not supported.

In our microservice infrastructure we were relying on Hystrix as an application circuit breaker. When downstream services are crashing (500 errors) we don't want to keep sending traffic to them, and Hystrix removed them temporaly from the pool.

Right now with Istio, this failing services are being invoked despite being actually down. That is to say we are protected from latency/overloading but not from failure. We can always translate 500 errors to 502/503/504 at the microservices but this seems like a work around.

Does this make sense?

codependent commented 5 years ago

Envoy explicitly distinguishes between consecutive gateway failures (502,503,504) errors and consecutive errors (5xx), supporting both of them.

https://www.envoyproxy.io/docs/envoy/v1.5.0/intro/arch_overview/outlier

Istio’s consecutiveErrors field is therefore misleading, shouldn’t it be aligned with Envoys terminology? Also, why not allow both kinds of configurations?

venosov commented 5 years ago

I think the 500 status code should be included in the circuit breaking, resilience4j and lagom give this support too.

howardjohn commented 5 years ago

@rshriram any thoughts ^?

frankbu commented 5 years ago

So there are really 2 issues here.

  1. The documentation is currently wrong. The consecutiveErrors field is documented correctly, but the overview/example is wrong and needs to be fixed.

  2. Should Istio support both kinds of outlier detection that Envoy supports (5xx and Gateway failures). If so, would it be sufficeint to add a boolean (e.g., allFailures to choose weather consecutiveErrors maps to all 5xx errors or just the Gateway ones?

@rshriram ^^^^

rshriram commented 5 years ago

wow.. this has come a full circle. So options here are to change the semantics of consecutiveErrors once again [but then people like @arnecls will want to set this up only for 502/3,etc.]. Or as @frankbu says, we could add a boolean that says "treat500ErrorsAsFailure` such that people can decide to use application error (HTTP 500) as a sign of failure.

irubnich commented 4 years ago

Just wanted to bump this and give a +1 to adding a boolean to support all 5XX-level errors instead of just 502-504 gateway errors.

My reasoning for wanting this is when dealing with third parties, you really can't control what kinds of failure codes they return. If a third party returns 500s when it's down and not 503s, yes that's semantically incorrect, but that's not something we control as integrators of an API.

Also, I think it's generally valid to want to treat multiple consecutive 500s as a good sign of "this host isn't healthy and we shouldn't route traffic to it".

Would love to see this issue get some more love!

yingzhuivy commented 4 years ago

+1 on supporting consecutive 5xx errors.

another option for supporting this instead of adding a boolean: we could add another field so consecutiveErrors = consecutive gateway errors consecutive5xxErrors = consecutive 5xx errors the benefit of this is that we could set separate threshold for the two just like envoy

yingzhuivy commented 4 years ago

We have two implementation options to support outlier detection for consecutive 5xx errors (500-599) instead of just gateway errors (502/503/504):

  1. Add a boolean, all_server_errors. By default, this is false, which means only 502, 503 and 504 return codes are considered errors. When this is set to true, all 5xx return codes are considered errors.
  2. Add a new int32, consecutive_server_errors that configures the number of consecutive server errors (all 5xx) before an endpoint is ejected. If the value is not set (default), endpoints are not ejected for non-gateway errors (e.g. 500, 505). Ejection could still be performed on gateway errors (502, 503, 504) using consecutive_errors.

Option 1 is clear. Users can configure exactly one way to reject endpoints based on upstream errors.

Option 2 provides a more flexible API. Users could configure more stringent outlier detection for gateway errors than for server errors. This comes at the cost of a more complex API: consecutive_errors and consecutive_server_errors will be confusing to configure for new users, especially because consecutive_errors actually means consecutive_gateway_errors.

I’m not sure how useful it is to be able to set two thresholds separately, which option do you guys prefer? @rshriram @frankbu @howardjohn

frankbu commented 4 years ago

@yingzhuivy I think Option 2 (the more flexible API) would be my choice. I think we can avoid confusing users by picking a better name for the new field. My suggestion for the 2 fields:

  1. consecutive_errors: consecutive gateway errors (502, 503, 504) as currently defined
  2. consecutive_5xx: consecutive errors of any 5xx type
howardjohn commented 4 years ago

@frankbu sounds good to me, although I would think consecutive_errors means ANY errors. But not worth changing the API to make that more clear. Instead we should just make it clearly documented.

yingzhuivy commented 4 years ago

Current documentation for consecutiveErrors has no description on the interpretation of zero value. Looking at the implementation, we do not distinguish zero from not set, and consecutiveErrors field has the following semantics:

For the new consecutive5xxErrors field we can follow the pattern to have:

The problem with this design is that there is no way to turn off consecutive gateway errors.

Some options to consider: Option 1: Distinguish between 0 and unset. 0 means disable the feature while unset means default.

consecutiveErrors:

consecutive5xxErrors:

This is the most straightforward and ideal approach, but need to change existing types to google.protobuf.UInt32Value. Or we could handle detecting unset value during yaml parsing. If this is not feasible, we could consider:

Option 2: Allow negative value, which represents turning the feature off; 0 still means default.

consecutiveErrors:

consecutive5xxErrors:

Option 3: Another option is to keep the existing API same, when either of the following is true, disable consecutive gateway errors outlier detection:

(If a user wants to enable consecutive 5xx only, they need to set consecutiveErrors equal to consecutive5xxErrors) This option is too complex and is hard to understand on user’s perspective; and there is still no way to disable both at the same time.

Thoughts? @frankbu @howardjohn

Also, with any of the options above, we need to have clear documentation, including the interpretation of zero/negative values.

frankbu commented 4 years ago

Maybe this is a good excuse to deprecate the consecutive_errors field, and add two new fields consecutive_gateway_errors and consecutive_5xx_errors with Option 1 semantics.

  // $hide_from_docs
  int32 consecutive_errors = 1 [deprecated=true];
yingzhuivy commented 4 years ago

ok, what's next step? should I submit a PR for deprecating consecutive_errors and adding two new fields? Is there a deprecation process that I should follow?

frankbu commented 4 years ago

@yingzhuivy Yes, just update the consecutive_errors field as I've shown above (mark as deprecated and hide from docs) and then add the two new fields with proper doc for how they work. Then you need to add the new code, but make sure to leave the old field working as before (the validator should complain if both fields are set).

rshriram commented 4 years ago

How about this:

monitorErrors:
  gatewayErrors: ..
  all5xxErrors: ..
yingzhuivy commented 4 years ago

@rshriram I think we are not expecting any other types of monitoring errors right? if that's the case, wouldn’t it be simpler to put directly instead of combining them into a struct?

cc @jhahn21 @brianwolfe

MaximeBernard commented 4 years ago

From someone new to Istio (and I have no idea about complexity),

outlierDetection: 
      httpStatusCodes: [500, 501, 502, 503, 504] # Default: [502, 503, 504]
      maxEjectionPercent: 100
      consecutiveErrors: 1
      interval: 5s

would be both simple to understand and easy to use (and therefore match all scenarios).

If, for some reason, people want to use 403 or 401 (auth service is failing?) or any other scenario, they could use it as they please.

But I guess this would mean a change in EnvoyProxy first to open that possibility.

[Edit] After thinking back about it, I realize this is not helpful while Envoy does not implement it. Sorry for wasting your time. I leave it here if any Envoy contributor comes by.