Open 0xced opened 5 years ago
Thank you, Cédric. That is indeed interesting. It sounds like the definition of "warning" for Nagios and in our spec are very different. The healthcheck spec considers "warning" to be "still functional but approaching dangerous zone" whereas Nagios clearly defines it as "was able to reach but basically unusable".
To be honest, I am not sure what is the benefit of having different levels of "unusable" since the reaction is the same in each case. Having different levels of "usable" however feels useful since you could start executing mitigating routines (e.g. adding more capacity) to avoid system going over the threshold of usefulness. Either way, yes - would be curious to see if we have more examples of prior art on this.
Thanks for sharing
What bothers me is that there's no difference between pass and warn at the HTTP status code level as both MUST return HTTP status in the 2xx-3xx range. Imagine you're implementing a dashboard to have an overview of your systems where pass translates to ✅, warn translates to ⚠ and fail translates to ❌. It would be impossible to implement with a HEAD request since pass and warn would be indistinguishable. I guess that it would not be an issue if HEAD requests are not considered.
Regarding your use-case: I don't believe HTTP status codes provide enough granularity to distinguish between "pass" and "warn" however we can still support HEAD by defining an HTTP header that is synonymous of the top-level "status". E.g.: health-status
HTTP header with the same exact semantics as the top-level status, would allow distinguishing between pass and warn states.
request:
Head /health HTTP/1.1
Host: api.example.org
response:
HTTP/1.1 200 OK
Last-Modified: Mon, 7 Dec 2015 15:29:14 GMT
Health-Status: warn
The problem with such approach is that responding with "warn" is usually unhelpful without additional information. Healthchecks are less for creating pretty dashboards and more for self-healing automation. Unless observing system knows what caused the "warn" they cannot kick-off mitigating routines. Indeed, current RFC draft says that more information SHOULD be provided during warn:
and additional information SHOULD be provided, utilizing optional fields of the response.
Additional information will, in general, be too complex to encode in headers (or we will end-up creating a parallel spec of the payload, for the headers, complicating client software's work). The only solution from this seems to require Status-URI
message header when Health Status is "warn" and it should point to GET version of the same health endpoint to retrieve more information.
The question is - is this worth it? Why not just always request GET. What are current examples of infrastructure middleware that relies on HEAD instead of GET?
Have you considered inclusion of Warning header to better contextualize the "warn" status. Described in RFC-7234 - https://tools.ietf.org/html/rfc7234#section-5.5 . Not sure how adopted the header usage is though, but it seems it could fit.
If you consider the meaning of the 3xx and 4xx status codes in the HTTP specification, none of them really apply to a warning. I'm not entirely sure why Nagios behaves that way but neither a "Redirection" or a "Client Error" should indicate that the server has a problem. That pretty much leaves the 2xx and 5xx status codes for this discussion (none of the well-known but un-official status codes really work either).
Since it's a "warn" state, I'd be in favor of leaving the specification as-is (200). I think it would also be great to add a header so that a HEAD request could also be used. One alternative is to create a new unofficial 5xx status code but I'd recommend against that.
Since one goal of this specification is to be compatible with Kubernetes liveness and readiness probes, I think the following paragraph from the Kubernetes document is important:
Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.
For a readiness probe, we don't want the system to abort a starting container because of a "warn" state and for a liveness probe we don't want the system to restart a functional (but probably degraded) container. In many of our systems, a warning state happens because one of its sub-services is down. This degrades the functionality of the service but doesn't necessarily degrade the operation of the service.
My 2 cents: Sending 4xx (which roughly translates to "client error, bad request of some sort") for warn sounds nuts to me. That's abuse of the HTTP and breaks all sorts of things. /health
endpoints should use HTTP status codes as per the various HTTP specifications and not break HTTP.
Here are a few things that would break if 4xx codes would be used to indicate warnings:
curl
or wget
I should have added this to my note above but in general it's expected that 4xx means the client should fix their request and try again. (Yes, there are a few exceptions like 403).
I think we have a wide consensus that "warn" should be HTTP 2xx and not 4xx.
@smoyer64, regarding the header for warn I like @RRunner1337's suggestion that RFC-7234 - https://tools.ietf.org/html/rfc7234#section-5.5 should be used.
If "Warning" header from RFC-7234 is to be used then I would propose to change the 4xx response code to 2xx response then in case of "warn" status "Warning" header MUST be present among the returned headers and clients can then understand that response as a warning.
Additionally I would propose to fix the 2xx response code to a proposed one.. for example 200 which fits GET and HEAD verb per HTTP RFC. In my opinion giving the implementor the capability to use any 2xx code could generate unneeded religious views (REST vs WebService) and protocol server / client implementors will need to implement "business logic" to determine status of the response which could present issues with the inter-operability.
I would also fix the 4xx-5xx error status code to a proposed one ... 418 maybe :)
From the current draft:
Just adding this to the discussion: the Nagios check_http plugin doesn't work this way. For a warning to be triggered, the HTTP status code must be in the 4xx range. A Nagios warning is defined like this:
I think it would be interesting to see what other plugins do and also having a look at other monitoring systems.