Open JamesUoM opened 5 years ago
No, relationship is not 1:1. It assumes certain ranges correspond to certain status values, but no 1:1 mapping exists. Per specification:
The value of the status field is case-insensitive and is tightly related with the HTTP response code returned by the health endpoint. For “pass” status, HTTP response code in the 2xx-3xx range MUST be used. For “fail” status, HTTP response code in the 4xx-5xx range MUST be used. In case of the “warn” status, endpoints MUST return HTTP status in the 2xx-3xx range, and additional information SHOULD be provided, utilizing optional fields of the response.
Your example warn = 307, a sub-service return an error state
is possible because sub-service error-ing out doesn't automatically mean the main service is also error-ing. Ideally, with proper circuit-breaking, sub-service error-ing out should only cause the main service to be in "warn" state, at worst but still reasonably functional.
Some of the shown mappings have serious issues. And the main reason lies in a misunderstanding about what level of information HTTP status codes represent. HTTP status codes mostly represent "wire". But when transmitting health status information over HTTP, one has to treat HTTP as OSI Layer 5 Session, not OSI Layer 7 Application. Trying to convey information about the content of the health resource instead of the wire of the health resource is conflating things in HTTP and may break clients.
Here's the details:
200
does not necessarily mean pass. 200
means that that the resource was found and works. The wire (to /health
and back again) works. It says nothing about the downstream.302
is "Found" (previously "Moved temporarily), which should not really be used anymore by servers, they should explicitly use 303 or 307. However, a client library like java.net.URL
, depending on its configuration, and web browsers will expect that a response to a 302, 303, or 307 will have a Location:
response header field with a URL to which they will automatically perform the request again. And health end points should be compatible with standard HTTP client libraries and web browsers.302
also applies to 307
.404
as a response to GET /health
means that /health
has no mapped resource on the server. Whether that can be mapped to running but unavailable
is doubtful, The path to health endpoints is explicitly not standardized. There may be a convention to often expect it at /health
, but aggregation servers without explicit health implementation may just forward the health endpoints of their downstream servers, and because they are many, must use different endpoints.503
means dead, but careful, it may also mean that the load balancer is dead, while the service is working. It definitely means that something is fishy and one of the elements in the wire is broken on the level of HTTP, but we can't conclude anything else from it. In case a server which provides health information replies with 503, it should mean that the server currently cannot provide health information at all. It would be surprising to see this generated on the level of a health endpoint implementation, this would normally be generated by the server library itself or the router.I hope that this explains why I think that mapping health status to response codes is not a good idea.
@christianhujer you are disagreeing with the wrong part of the specification. It's not about just some of the values of the mappings, you are questioning the entire approach of using response code of a single endpoint as an indicator of health for the entire API (several endpoints forming a logical "API" that the health endpoint represents). This is explicitly documented in RFC:
A health endpoint is only meaningful in the context of the component it indicates the health of. It has no other meaning or purpose. As such, its health is a conduit to the health of the component. Clients SHOULD assume that the HTTP response code returned by the health endpoint is applicable to the entire component (e.g. a larger API or a microservice). This is compatible with the behavior that current infrastructural tooling expects: load-balancers, service discoveries and others, utilizing health-checks.
Yes, it is s true that status codes in HTTP per se represent the information about the specific URI and don't act as stand-ins for other resources at other URIs. However (!) and this is critical: doing so in the case of health endpoints has been a common practice for decades. For the sake of acknowledging reality and backwards compatibility with the existing tooling, the responsible thing for this RFC is to follow the established pattern.
As a matter of fact, acknowledging this existing pattern is the only reason this RFC even discusses HTTP response codes. In its pure form this RFC is about message format and has nothing to do with HTTP or response codes (it's totally fine to use it with TCP, for instance). But when you have clearly existing industry behavior, we felt it would have been a mistake of omission to not acknowledge it in this RFC. Practicality and backwards compatibility > theoretical purity.
I think the spec is fine. Infering from GET /health ⇒ 200
that the entire component is fine, and it's also fine that GET /health ⇒ 500
means that there are issues, and that the body has to be inspected to find out on which level the issues are. After all, /health
is a meta-resource to provide information about an entire component, thus, some inference of status codes on the entire component is valid.
Just reusing 302
or 307
(redirections) as warnings, that will break HTTP clients and spell all sorts of trouble.
OK
Not sure what is special about [302,307]. In this spec "warning" is defined as "pass, but things are getting worse so pay attention", so http response code-wise "pass" and "warning" are the same thing, only in the message do we differentiate.
Given that, if 302,307 are ok for "pass", why are you concerned for "warning"? Warning is not defined as a "light error".
I am not sure I follow where 302,307 is a concern.
Thank you.
From what I understand of this conversation, it seems that the concern being raised is whether or not this spec implies that 302 and 307 have additional semantics which an implementation needs to take into account. If an HTTP client used in an implementation is configured to automatically follow redirects, the health-check implementation might not get an opportunity to respond to a 302 or 307 and interpret it according to this specification.
I guess the thing which may be missing is how the usual semantics of redirects from a /health
endpoint would impact the response code semantics defined in this spec.
If GET /health
responds with 302
or 307
, client libraries will throw errors or exceptions. The expectation according to HTTP for a response that has status code 302
or 307
is that the response includes a Location:
header with a new URL, and that the client is to repeat the request to that URL. A lot of client libraries do this per default, without the programmer having to take care of this, because redirects are so common.
In general, any "re-interpretation" of existing response codes by a new spec that is incompatible with the actual specification in HTTP/1.1 or HTTP/2 risks breaking existing user agents and makes the lives of programmers who would expect that they can simply use normal HTTP clients to access /health
endpoints for writing health aggregators, health monitors and so on unnecessarily more difficult.
What exactly makes "302 Found" or "307 Temporary Redirect" good candidates to indicate status: warn
?
As a resolution, this spec could just explicitly require that all responses should otherwise comply with the HTTP specifications (this spec may already do this--I have not confirmed). This would imply the Location header would need to be included in case the response code is 302
or 307
.
Thanks for all the feedback, but I feel the conversation has been distracted with the example status codes I used (granted that was my fault). But I'm and still unclear how the codes map to the textual statuses. Maybe further examples will clarify what I mean:
1) "Healthy true or false" pass : 20x warn : 20x see json for details fail : 50x
2) distinct codes per status pass : 20x warn : 30x see json for details fail : 50x
3) codes per degree of warning, fail pass : 20x warn : 30x something might be wrong warn : 30x+1 something a bit worse fail : 40x up but can't respond fail : 50x not working at all
IMHO clearly not 2 or 3. That's not how HTTP works, and the spec has to be compatible with HTTP.
307 Temporary Redirect seems just fine provided a redirect location is actually used that returns something meaningful. I don't see any conflict in letter or spirit to the HTTP spec.
We do redirects precisely because the server is asked for resources it can't faithfully provide per how it defines those resources. The server can decide what criteria justifies the existence of the /health resource. When those criteria aren't met, the server can say "well you wanted the good news, but I'm redirecting you to the not so good news". Redirect says the resource requested isn't available, but a viable substitute is.
For example, GET /health might redirect to /health_warnings. A client who wishes to know what those health_warnings are could then follow the redirect and get an application/health+json document. Since detailing the degraded health might be expensive, a polite client that doesn't need that information might decide not to follow the redirect.
202 Accepted is also interesting if you want to differentiate from 200 OK for good health. Differentiating by response code makes it really convenient to use monitoring tools to compute "good" state availability and not just the "up" state availability.
We've had this discussion in our team. In a nutshell, we're seeing /health
as more of a business layer endpoint.
We do not agree with fail
being in the 4xx-5xx
range because:
4xx
errors indicate client side errors => out of scope.5xx
, we found that no error is really relevant to a fail
and could be mistaken for a load balancer or server down issue.For us, 200
seemed like better candidate for all statuses:
200
indicates that our health check endpoint is up and has served our request correctly.
I'm currently writing a health check for an application, but I'm a little unclear on the precise relationship between statuses and response codes. The is described as "tightly coupled" but with no further explanation or examples.
Do statuses have a 1:1 relationship to HTTP response codes? I ask as I have cases where it may be useful to have finer grain responses.
pass = 200 warn = 302, a sub-service return a warning state warn = 307, a sub-service return an error state fail = 404, running but unavailable fail = 503, dead
In different scenarios, we may want a load balancer to bracket different response code ranges as healthy eg [200:302] or [200:307]. Assume the load balancer can only monitor the response code. I know we could always write a customizable filter in front of the health check that can decide talked to the load balancer based on the json.
Thanks for the draft is has been most helpful and timely.