REST API response status codes?

cmgrote commented 4 years ago

Per discussion on #2353:

... I need to comment on the 200 response codes. In the REST protocol it is only possible to return a response object if the status code is 200. If we returned, say 500 then it would not be possible to return the error message.

Is this a limitation on our side, or a difference in a standard that defines how REST API status codes differ from the HTTP protocol itself? I can't see anything in the HTTP standard restricting message bodies (response objects?) to status code 200. The overview of status codes (https://tools.ietf.org/html/rfc7231#section-6) explains:

"HTTP clients are not required to understand the meaning of all registered status codes, though such understanding is obviously desirable. However, a client MUST understand the class of any status code, as indicated by the first digit... For example, if an unrecognized status code of 471 is received by a client, the client can assume that there was something wrong with its request and treat the response as if it had received a 400 (Bad Request) status code. The response message will usually contain a representation that explains the status."

(Per the last sentence, a non-200 code of 471 has a response message that provides further explanation.) Regarding a 500 response, specitically (https://tools.ietf.org/html/rfc7231#section-6.6):

"Except when responding to a HEAD request, the server SHOULD send a representation containing an explanation of the error situation, and whether it is a temporary or permanent condition. A user agent SHOULD display any included representation to the user."

Using 200 is not breaking the protocol because it does not mean "success" as many people think - it mean there is further information in the response object.

Again, is this something specific to REST APIs vs the HTTP protocol itself? From the standard (https://tools.ietf.org/html/rfc7231#section-6.3), it feels pretty strongly worded to me that 200 is defined as success:

"Successful 2xx: The 2xx (Successful) class of status code indicates that the client's request was successfully received, understood, and accepted. The 200 (OK) status code indicates that the request has succeeded."

It's confusing to me that we would specify different HTTP status codes (like 500, 404, etc) on the definition of errors, but not actually use them as the response status code -- which per the standard references above seems like it should be possible? -- and I suspect will be confusing to users as well (who upon seeing a 200 response status may well simply ignore any body of the response itself)?

Suggestion:

if we can, we change our approach to make use of the HTTP status codes and record a message body in the response
if not, that we clearly document somewhere quite clearly that the 200 codes we use in responses do not necessarily mean success.

cmgrote commented 4 years ago

On the call we questioned whether these response bodies could also apply to DELETEs, it looks like they can (https://tools.ietf.org/html/rfc7231#section-4.3.5):

"If a DELETE method is successfully applied, the origin server SHOULD send a 202 (Accepted) status code if the action will likely succeed but has not yet been enacted, a 204 (No Content) status code if the action has been enacted and no further information is to be supplied, or a 200 (OK) status code if the action has been enacted and the response message includes a representation describing the status."

(Implying that there are response messages available in addition to the status code, even on deletes.)

cmgrote commented 4 years ago

(I've linked to the latest HTTP/1.1 spec above, but even going back to the very first HTTP/1.0 spec from 1992 there were response bodies possible against non-200 status codes: https://www.w3.org/Protocols/HTTP/HTRESP.html)

cmgrote commented 4 years ago

Happy to look into rolling this out; on the call someone mentioned that in the past we were sending non-200 responses -- any quick pointers on where / how to we were doing that? Just looking for where to start with the response status codes (ie. where they're set / returned by the rest API itself), and in particular in a non-Spring-specific manner to remain portable (?)

cmgrote commented 4 years ago

It appears that the way to handle these non-200 responses in Spring is via exceptions themselves... However, we of course don't want to make our exceptions (or exception-handling) Spring-specific, so I'm thinking of the following approach:

create a new common-services module specifically for Spring (common-services/spring-services)
inside the new module I'll define a new Spring-specific exception and exception handler (using Spring's @ControllerAdvice -- which is how I can control the HTTP status returned for any errors)
I'll also add a simple utility class & method to check any generic response objects we have for an error, and if found raise this new exception (to be handled by the exception handler) with the HTTP status code as set in the response itself

This just means the following changes to existing modules where spring resources are defined:

include this new common service module as a pom dependency
instead of directly returning the response for each resource, pass the response to the utility method that will check for any error and raise a Spring-specific exception if any is found

... and our existing exception handling and recording (outside of Spring) should remain entirely untouched.

grahamwallis commented 4 years ago

To my mind there are 3 options:

stay as we are - i.e. continue to use 200 to convey both success or an error with a response in the body
change to only using 200 for success - and not worry about breaking compatibility - although I would generally not approve of this, I think it is a valid option given the current set of adopters
change to only using 200 for success - and worry about breaking compatibility - I think this would a) be tons of work, b) complicate the APIs and c) is not necessary given the current set of adopters

So I think it comes down to a choice between 1 and 2 - meaning that in my opinion it would be OK to change and break compatibility if we want to; and the decision is about how we want the API to look.

In defence of option 1:

Egeria is not alone - e.g. Facebook use 200 for everything (i.e. including error responses)
The status code can be considered to form part of the API or considered not to. Either is valid (see the answer from "Larry K" (located just below all the flowchart pictures) in the second ref below
The use of 200 can be considered to mean that the transport layer was successful, the error response is conveying the status of the higher layer(s) that is nothing to do with HTTP
This is known to work, even with outdated gateways that may not tolerate response bodies on responses with error codes

In defence of option 2:

This is probably more intuitive for today's programmers and API users
It is now more likely (with modern frameworks) that the client can get access to the status code (this was not always the case)

The one thing that is critical with EITHER approach is that the content of the response body needs to accurately and adequately convey the outcome of the operation including the suggested user-action in a human-intelligible form.

My current preference would be to initially talk to people in industry sectors where there may be old gateways, to assess the risk of option 2, then if the risk is deemed low, adopt option 2.

Some useful references:

On use of status codes relative to success/failure of the operation: https://blog.restcase.com/rest-api-error-codes-101/ (Note that Facebook use 200 whatever the outcome; but the author's recommendation is counter to what Facebook do.)

Useful thread on use of HTTP status codes with regard to success/failure: https://stackoverflow.com/questions/942951/rest-api-error-return-good-practices (See in particular the answer from "Larry K" (located just below all the flowchart pictures)

On standardisation of error responses: https://dzone.com/articles/rest-api-error-handling-problem-details-response Quite interesting: RFC7807

On documentation of REST APIs: https://www.openapis.org Linux Foundation OAS

cmgrote commented 4 years ago

Hmmm... I thought we'd discussed this on the call and agreed that we should move ahead with option (2) unless there was a standard that dictated otherwise, but maybe I over-interpreted... As best I can tell the references linked are all opinion-based rather than a standard?

The only exception being the last one for OAS, which to me clearly advocates option 2 (http://spec.openapis.org/oas/v3.0.2#http-status-codes -- itself referring to the IETF / IANA standard):

"The HTTP Status Codes are used to indicate the status of the executed operation. The available status codes are defined by [RFC7231] and registered status codes are listed in the IANA Status Code Registry."

As mentioned earlier, response bodies have been allowed on non-200 responses since the early '90s, so we're talking ~30 years now... So I'm certainly keen that we address your point on industry sector input where there may be old gateways, but we'd presumably be talking about gateways that are > 30 years old (?)

cmgrote commented 4 years ago

Actually I'm not sure I understand this point about old gateways in general -- my understanding is that for them to not support response bodies they'd need to not support (predate?) the ~30-year old HTTP/1.0 spec, so I'm not sure I follow how they'd interpret / handle HTTP at all (status codes or response bodies) (?)

mandy-chessell commented 4 years ago

All options are valid by the spec, so we are just discussing style.

I was leaning towards option 2 but given the discussions above, I have changed my mind and think option 1 is better for us.

Option 1 retains backward compatibility for our APIs, ensures our errors will always get through and allows us to distinguish between requests that fail in the transport layer rather than in our server. This could be really useful in complex multi-cloud environments.

The more minor point is that it is the least amount of work and we are desparately overloaded.

mandy-chessell commented 4 years ago

I was just reading through the earlier comments and completely agree we need to document our REST API style - not just the use of the status codes - but also the way we format complex request/response bodies, the handling of exceptions, structure of the URL.

Our initial decision was that the REST APIs were private to Egeria and that technology is plugged in through the connectors - or by calling the client interfaces.

However, having the REST APIs as external APIs has advantages in supporting multi-programming environments - we see this advantage in the Python Notebooks - so it is reasonable to revisit this decision.

cmgrote commented 4 years ago

I can appreciate that option 1 has some advantages, but I also believe some potentially significant disadvantages.

In support of option 2:

REST API Standards: while the HTTP spec itself would support either approach, I have not been able to identify any REST API standard (which admittedly are all emerging rather than as well-established as HTTP itself) that advocates option 1. Both fellow LF project OAS (http://spec.openapis.org/oas/v3.0.2#http-status-codes) and JSON:API (https://jsonapi.org/format/#crud-updating-responses) advocate for option 2. I am hesitant not to adopt the approach advocated by such standards that specialise in this domain more than ourselves.
Consumability: we seem to be implying that if we adopt option (2) we won't be able to distinguish between transport layer failures and internal Egeria failures, but I don't believe that's accurate. Under option 2, since we would still place a response body against any non-200 response statuses as well, that response body can be used to determine whether the failure was at the transport-layer (response body not present) or Egeria-internal (response body present). For me it's rather a question of priority: is it more important to a consumer to know whether the request was successful or not, or what caused a failure if there was one? (Both options can answer both questions, but there is more work involved to answer one question vs the other depending on the option chosen.) In other words, I wonder if those adopting APIs do not have an initial hierarchy of concerns when receiving a response along the following lines: a) did it succeed or fail, and then b) only if it failed, why did it fail? By adopting option 1 we are forcing the API consumer to parse into every response to determine true success or failure, as that is only identified within the body of the response itself with option 1. (This also requires consumers to understand this response format to be able to check for errors.) I'm seeing the two options from a consumption perspective as:
1. Did I have a transport failure? No, I got a 200 with a response body. Now let me check the response body to see whether my request worked or had an error. (Or Yes, now I fix the transport error and go back to seeing whether there is another transport failure or a response whose body I need to introspect for actual success / failure status.) Getting to "success" always requires 2 hops: status code check, then response body parsing. Understanding failure could require either 1 hop (status code alone, which could be transport or framework (Spring itself could return such codes still in certain conditions)), or 2 hops (status code is OK, but response body has an error inside it).
2. Did my request work? Yes, I got a 2xx response. (Or No, I got a non-2xx response: now let me look into the response body to see why it failed.) Getting to "success" always requires only 1 hop: the status code. Understanding failure always requires 2 hops: non-2xx response, parse the response body (if any).
Ease-of-interpretation: Popular libraries seem to adopt the overall success vs failure check of (ii), so in the multi-programming environments we want to support it may be the simpler / quicker option for consumers? For example Python's requests library (3rd most-downloaded library in PyPI with 13M downloads / week) provides a raise_for_status() method that automatically raises errors for unsuccessful responses based on the response code (https://2.python-requests.org/en/master/user/quickstart/#response-status-codes). Furthermore, it automatically interprets the success of a response as a boolean based on the response code (between 200 and 400 as success, and anything else as a failure (https://realpython.com/python-requests/#status-codes)). That is, they can programmatically check this without needing to first understand our response body format (find our documentation, read it, understand it, etc) to parse it for a potential error. (Another example being curl (https://curl.haxx.se/docs/manpage.html#-f), which is perhaps even more significant given potential challenges parsing JSON from command-line scripts to detect whether there is an error inside the body of the response or not vs. simply checking the status code for success or failure. While additional libraries can be installed to help with this (eg. jq or even using python or other higher-level languages themselves, this increases the footprint necessary for simple automation via areas like containers.)
Adoption: I've also quickly surveyed different metadata APIs for their own approaches. I think it would be helpful to follow a similar pattern as the rest of the community with which we want to integrate. I have not been able to identify any that follow option 1. All of the following take the approach of option 2:
- WKC (uses 4xx and 5xx responses, as do the Watson Data APIs in general)
- IGC (uses 201 as well as 200 for success, 4xx responses to indicate improper request bodies, and responds with 404 to requests for instances (entities) that do not exist)
- Apache Atlas (uses 4xx responses to indicate improper request bodies, and responds with 404 to requests for instances (entities, relationships) that do not exist)
- CKAN (uses 4xx and 5xx responses)
- Lyft's Amundsen (uses 5xx responses)
- WeWork's Marquez (uses 201 as well as 200 for success, and appears that it would use 4xx and 5xx responses as while not formally documented the code uses javax.ws.rs, extends its built-in exceptions (that do generate such codes) and throws them from the API resources under certain conditions)

I can appreciate that some codes (like 404) may seem particularly confusing in this server-internal vs transport-layer error debate between option 1 and option 2. However, I also see that even for a 404 response, given that our REST resource URLs for things like getEntity actually have the GUID in the URL itself, that it's debatable whether this is a valid URL or not in the same way that mis-spelling some other portion of the URL would be invalid (and validly result in a 404) -- from the perspective of a URL that does not exist, both seem equally relevant to me.

However, IMHO it would be better to take such debatable codes as an exception we explicitly handle rather than as a reason to entirely discard option 2: eg. perhaps under option 2 we instead opt not to use 404's at all, so for endpoints like getEntity we simply return a 200 status with a null or empty response object (rather than any error body). (For me it's debatable whether it's even an error that someone has requested an entity instance that we have not stored in our repository, for example, or whether we should simply report that the request succeeded but we have no such entity in our repository.)

Equally, I feel that other codes are quite clear-cut: a 500 Internal Server Error to me is a clear indication that something unexpected has gone wrong in our processing, within the server (and not at the transport level). And per my book above (😉), would be a very clear communication that does not require me to understand the response body format to understand that the server did not succeed in processing my request.

(Edited to also point out how option 2 can assist with command-line automation via common tools like curl which are often used in environments that won't have complex JSON parsing capabilities.)

planetf1 commented 4 years ago

My thoughts

I think in practical terms our REST APIs are public - I don't think we've been clear enough to say they are private for consumers... as per references to notebook
- I am inclined to consider the proposal is good, as it does make the API a little clearer IMO - @cmgrote your last post & research is superb.
- However it does break backward compatibility, and we've made efforts to stress we would aim not to do this. We perhaps are in the stage where we could still just about make this change if managed carefully

One option (though it could be deemed procrastination?) is to post a proposal to make this change very clearly on our public list/chat & gather feedback over a release cycle. Perhaps linking in this with clarification over the status ( -> public? ) of the rest api

grahamwallis commented 4 years ago

Good discussion and I agree with pretty much everything that has been said, including the point that this is about more than just implementation of a rigid specification - it depends on how one applies the specifications in terms of mapping to architectural concepts (e.g. layers) and both specs/standards and precedent/opinion are important.

On both fronts I am still of the opinion that our APIs would be better if we were to implement option 2.

Reading Nigel's comment above I realise I should clarify my earlier definition of "option 2" in which I said "not worry about breaking compatibility". What I meant is that we should not attempt to preserve backward compatibility - e.g. by API versioning or introducing a dual API - those would be the activities of option 3 (which I am not in favour of). If we were to adopt option 2 I think we need to be very public about the proposed change and provide [at least] a release of warning. My thoughts were that we might propose the change with 1.4 and implement it in 1.5 (subject to approval by the wider community) - together with comprehensive documentation.

cmgrote commented 4 years ago

Yes, I think we’d have to have this release-long request for review on option 2 as per the 1.1 release notes I would consider (at least some of) the APIs to be “released functionality”: https://github.com/odpi/egeria/blob/master/release-notes/release-notes-1-1.md

(Though without thorough documentation on error formats, status codes, etc there is potentially still room for interpretation on its level of “released”...)

But as far as I'm aware we can't really do anything with Egeria without first configuring a cohort, repository, etc? Short of someone magically knowing how to write up a config document by hand (doubtful?) all of that requires the use of the administration API -- and I don't believe there's any client yet for the administration (only the API). So I think there's a pretty strong argument that at least some of the APIs are public, released, and have been fully adopted by everyone that's using Egeria 😉

cmgrote commented 4 years ago

Based on today's discussion we agreed that given where we are with other implementations building on the backbone of OMRS, that we would retain the existing style (option 1) for OMRS.

For OMAS endpoints, we could/should consider the use of option 2.

cmgrote commented 4 years ago

Closing as I believe this is now decided for OMRS, and permitted for OMASes.

odpi / egeria

REST API response status codes? #2360