Open howardjohn opened 5 years ago
@howardjohn Thanks for clearing it up.
Just one question why isn't 500 status code included in the circuit breaking? In my view a service that returns continuously Internal Server Errors should also be also ejected from the pool. Otherwise we are keeping a service that is failing for whatever reason.
Some context from the change #617 that introduced it:
We try not to expose the plethora of sometimes confusing Envoy options to end users, in the routing api.
Within a mesh, gateway errors will be more common (502/503/504) while most sensible external services will return a 503 to shed load.
Secondly we just made the outlier detection generic to both tcp and http. The consecutive gateway error applies only to http and will make no sense in tcp context.
I also feel that 500 error code is not something indicative of overload. The whole idea behind outliers is to remove overloaded servers from the lb pool.
We don’t have very many users relying on this behavior I think. We kept it intentionally generic so that we can switch to a more specific error code in future (which happens to be now).
I understand the motivation now however I think there's another valid scenario that is not supported.
In our microservice infrastructure we were relying on Hystrix as an application circuit breaker. When downstream services are crashing (500 errors) we don't want to keep sending traffic to them, and Hystrix removed them temporaly from the pool.
Right now with Istio, this failing services are being invoked despite being actually down. That is to say we are protected from latency/overloading but not from failure. We can always translate 500 errors to 502/503/504 at the microservices but this seems like a work around.
Does this make sense?
Envoy explicitly distinguishes between consecutive gateway failures (502,503,504) errors and consecutive errors (5xx), supporting both of them.
https://www.envoyproxy.io/docs/envoy/v1.5.0/intro/arch_overview/outlier
Istio’s consecutiveErrors field is therefore misleading, shouldn’t it be aligned with Envoys terminology? Also, why not allow both kinds of configurations?
I think the 500 status code should be included in the circuit breaking, resilience4j and lagom give this support too.
@rshriram any thoughts ^?
So there are really 2 issues here.
The documentation is currently wrong. The consecutiveErrors
field is documented correctly, but the overview/example is wrong and needs to be fixed.
Should Istio support both kinds of outlier detection that Envoy supports (5xx and Gateway failures). If so, would it be sufficeint to add a boolean (e.g., allFailures
to choose weather consecutiveErrors
maps to all 5xx errors or just the Gateway ones?
@rshriram ^^^^
wow.. this has come a full circle. So options here are to change the semantics of consecutiveErrors once again [but then people like @arnecls will want to set this up only for 502/3,etc.]. Or as @frankbu says, we could add a boolean that says "treat500ErrorsAsFailure` such that people can decide to use application error (HTTP 500) as a sign of failure.
Just wanted to bump this and give a +1 to adding a boolean to support all 5XX-level errors instead of just 502-504 gateway errors.
My reasoning for wanting this is when dealing with third parties, you really can't control what kinds of failure codes they return. If a third party returns 500s when it's down and not 503s, yes that's semantically incorrect, but that's not something we control as integrators of an API.
Also, I think it's generally valid to want to treat multiple consecutive 500s as a good sign of "this host isn't healthy and we shouldn't route traffic to it".
Would love to see this issue get some more love!
+1 on supporting consecutive 5xx errors.
another option for supporting this instead of adding a boolean: we could add another field so consecutiveErrors
= consecutive gateway errors
consecutive5xxErrors
= consecutive 5xx errors
the benefit of this is that we could set separate threshold for the two just like envoy
We have two implementation options to support outlier detection for consecutive 5xx errors (500-599) instead of just gateway errors (502/503/504):
all_server_errors
. By default, this is false, which means only 502, 503 and 504 return codes are considered errors. When this is set to true, all 5xx return codes are considered errors.consecutive_server_errors
that configures the number of consecutive server errors (all 5xx) before an endpoint is ejected. If the value is not set (default), endpoints are not ejected for non-gateway errors (e.g. 500, 505). Ejection could still be performed on gateway errors (502, 503, 504) using consecutive_errors. Option 1 is clear. Users can configure exactly one way to reject endpoints based on upstream errors.
Option 2 provides a more flexible API. Users could configure more stringent outlier detection for gateway errors than for server errors. This comes at the cost of a more complex API: consecutive_errors
and consecutive_server_errors
will be confusing to configure for new users, especially because consecutive_errors
actually means consecutive_gateway_errors
.
I’m not sure how useful it is to be able to set two thresholds separately, which option do you guys prefer? @rshriram @frankbu @howardjohn
@yingzhuivy I think Option 2 (the more flexible API) would be my choice. I think we can avoid confusing users by picking a better name for the new field. My suggestion for the 2 fields:
consecutive_errors
: consecutive gateway errors (502, 503, 504) as currently definedconsecutive_5xx
: consecutive errors of any 5xx type@frankbu sounds good to me, although I would think consecutive_errors
means ANY errors. But not worth changing the API to make that more clear. Instead we should just make it clearly documented.
Current documentation for consecutiveErrors has no description on the interpretation of zero value. Looking at the implementation, we do not distinguish zero from not set, and consecutiveErrors field has the following semantics:
For the new consecutive5xxErrors field we can follow the pattern to have:
The problem with this design is that there is no way to turn off consecutive gateway errors.
Some options to consider: Option 1: Distinguish between 0 and unset. 0 means disable the feature while unset means default.
consecutiveErrors:
consecutive5xxErrors:
This is the most straightforward and ideal approach, but need to change existing types to google.protobuf.UInt32Value. Or we could handle detecting unset value during yaml parsing. If this is not feasible, we could consider:
Option 2: Allow negative value, which represents turning the feature off; 0 still means default.
consecutiveErrors:
consecutive5xxErrors:
Option 3: Another option is to keep the existing API same, when either of the following is true, disable consecutive gateway errors outlier detection:
(If a user wants to enable consecutive 5xx only, they need to set consecutiveErrors equal to consecutive5xxErrors) This option is too complex and is hard to understand on user’s perspective; and there is still no way to disable both at the same time.
Thoughts? @frankbu @howardjohn
Also, with any of the options above, we need to have clear documentation, including the interpretation of zero/negative values.
Maybe this is a good excuse to deprecate the consecutive_errors
field, and add two new fields consecutive_gateway_errors
and consecutive_5xx_errors
with Option 1 semantics.
// $hide_from_docs
int32 consecutive_errors = 1 [deprecated=true];
ok, what's next step? should I submit a PR for deprecating consecutive_errors and adding two new fields? Is there a deprecation process that I should follow?
@yingzhuivy Yes, just update the consecutive_errors field as I've shown above (mark as deprecated and hide from docs) and then add the two new fields with proper doc for how they work. Then you need to add the new code, but make sure to leave the old field working as before (the validator should complain if both fields are set).
How about this:
monitorErrors:
gatewayErrors: ..
all5xxErrors: ..
@rshriram I think we are not expecting any other types of monitoring errors right? if that's the case, wouldn’t it be simpler to put directly instead of combining them into a struct?
cc @jhahn21 @brianwolfe
From someone new to Istio (and I have no idea about complexity),
outlierDetection:
httpStatusCodes: [500, 501, 502, 503, 504] # Default: [502, 503, 504]
maxEjectionPercent: 100
consecutiveErrors: 1
interval: 5s
would be both simple to understand and easy to use (and therefore match all scenarios).
If, for some reason, people want to use 403
or 401
(auth service is failing?) or any other scenario, they could use it as they please.
But I guess this would mean a change in EnvoyProxy first to open that possibility.
[Edit] After thinking back about it, I realize this is not helpful while Envoy does not implement it. Sorry for wasting your time. I leave it here if any Envoy contributor comes by.
https://istio.io/docs/reference/config/networking/v1alpha3/destination-rule/#OutlierDetection
Under consecutiveErrors: 502, 503, and 504s only, 500 codes will not trigger the outlier detection
But in the docs above:
For HTTP services, hosts that continually return 5xx errors for API calls are ejected from the pool for a pre-defined period of time.
So sounds like the docs are incorrect here. It will only work for 502/503/504 from looking at the code and https://github.com/istio/api/pull/617