odpi / egeria

Egeria core
https://egeria-project.org
Apache License 2.0
799 stars 260 forks source link

REST API response status codes? #2360

Closed cmgrote closed 4 years ago

cmgrote commented 4 years ago

Per discussion on #2353:

... I need to comment on the 200 response codes. In the REST protocol it is only possible to return a response object if the status code is 200. If we returned, say 500 then it would not be possible to return the error message.

Is this a limitation on our side, or a difference in a standard that defines how REST API status codes differ from the HTTP protocol itself? I can't see anything in the HTTP standard restricting message bodies (response objects?) to status code 200. The overview of status codes (https://tools.ietf.org/html/rfc7231#section-6) explains:

"HTTP clients are not required to understand the meaning of all registered status codes, though such understanding is obviously desirable. However, a client MUST understand the class of any status code, as indicated by the first digit... For example, if an unrecognized status code of 471 is received by a client, the client can assume that there was something wrong with its request and treat the response as if it had received a 400 (Bad Request) status code. The response message will usually contain a representation that explains the status."

(Per the last sentence, a non-200 code of 471 has a response message that provides further explanation.) Regarding a 500 response, specitically (https://tools.ietf.org/html/rfc7231#section-6.6):

"Except when responding to a HEAD request, the server SHOULD send a representation containing an explanation of the error situation, and whether it is a temporary or permanent condition. A user agent SHOULD display any included representation to the user."

Using 200 is not breaking the protocol because it does not mean "success" as many people think - it mean there is further information in the response object.

Again, is this something specific to REST APIs vs the HTTP protocol itself? From the standard (https://tools.ietf.org/html/rfc7231#section-6.3), it feels pretty strongly worded to me that 200 is defined as success:

"Successful 2xx: The 2xx (Successful) class of status code indicates that the client's request was successfully received, understood, and accepted. The 200 (OK) status code indicates that the request has succeeded."

It's confusing to me that we would specify different HTTP status codes (like 500, 404, etc) on the definition of errors, but not actually use them as the response status code -- which per the standard references above seems like it should be possible? -- and I suspect will be confusing to users as well (who upon seeing a 200 response status may well simply ignore any body of the response itself)?

Suggestion:

  1. if we can, we change our approach to make use of the HTTP status codes and record a message body in the response
  2. if not, that we clearly document somewhere quite clearly that the 200 codes we use in responses do not necessarily mean success.
cmgrote commented 4 years ago

On the call we questioned whether these response bodies could also apply to DELETEs, it looks like they can (https://tools.ietf.org/html/rfc7231#section-4.3.5):

"If a DELETE method is successfully applied, the origin server SHOULD send a 202 (Accepted) status code if the action will likely succeed but has not yet been enacted, a 204 (No Content) status code if the action has been enacted and no further information is to be supplied, or a 200 (OK) status code if the action has been enacted and the response message includes a representation describing the status."

(Implying that there are response messages available in addition to the status code, even on deletes.)

cmgrote commented 4 years ago

(I've linked to the latest HTTP/1.1 spec above, but even going back to the very first HTTP/1.0 spec from 1992 there were response bodies possible against non-200 status codes: https://www.w3.org/Protocols/HTTP/HTRESP.html)

cmgrote commented 4 years ago

Happy to look into rolling this out; on the call someone mentioned that in the past we were sending non-200 responses -- any quick pointers on where / how to we were doing that? Just looking for where to start with the response status codes (ie. where they're set / returned by the rest API itself), and in particular in a non-Spring-specific manner to remain portable (?)

cmgrote commented 4 years ago

It appears that the way to handle these non-200 responses in Spring is via exceptions themselves... However, we of course don't want to make our exceptions (or exception-handling) Spring-specific, so I'm thinking of the following approach:

This just means the following changes to existing modules where spring resources are defined:

... and our existing exception handling and recording (outside of Spring) should remain entirely untouched.

grahamwallis commented 4 years ago

To my mind there are 3 options:

  1. stay as we are - i.e. continue to use 200 to convey both success or an error with a response in the body
  2. change to only using 200 for success - and not worry about breaking compatibility - although I would generally not approve of this, I think it is a valid option given the current set of adopters
  3. change to only using 200 for success - and worry about breaking compatibility - I think this would a) be tons of work, b) complicate the APIs and c) is not necessary given the current set of adopters

So I think it comes down to a choice between 1 and 2 - meaning that in my opinion it would be OK to change and break compatibility if we want to; and the decision is about how we want the API to look.

In defence of option 1:

In defence of option 2:

The one thing that is critical with EITHER approach is that the content of the response body needs to accurately and adequately convey the outcome of the operation including the suggested user-action in a human-intelligible form.

My current preference would be to initially talk to people in industry sectors where there may be old gateways, to assess the risk of option 2, then if the risk is deemed low, adopt option 2.

Some useful references:

On use of status codes relative to success/failure of the operation: https://blog.restcase.com/rest-api-error-codes-101/ (Note that Facebook use 200 whatever the outcome; but the author's recommendation is counter to what Facebook do.)

Useful thread on use of HTTP status codes with regard to success/failure: https://stackoverflow.com/questions/942951/rest-api-error-return-good-practices (See in particular the answer from "Larry K" (located just below all the flowchart pictures)

On standardisation of error responses: https://dzone.com/articles/rest-api-error-handling-problem-details-response Quite interesting: RFC7807

On documentation of REST APIs: https://www.openapis.org Linux Foundation OAS

cmgrote commented 4 years ago

Hmmm... I thought we'd discussed this on the call and agreed that we should move ahead with option (2) unless there was a standard that dictated otherwise, but maybe I over-interpreted... As best I can tell the references linked are all opinion-based rather than a standard?

The only exception being the last one for OAS, which to me clearly advocates option 2 (http://spec.openapis.org/oas/v3.0.2#http-status-codes -- itself referring to the IETF / IANA standard):

"The HTTP Status Codes are used to indicate the status of the executed operation. The available status codes are defined by [RFC7231] and registered status codes are listed in the IANA Status Code Registry."

As mentioned earlier, response bodies have been allowed on non-200 responses since the early '90s, so we're talking ~30 years now... So I'm certainly keen that we address your point on industry sector input where there may be old gateways, but we'd presumably be talking about gateways that are > 30 years old (?)

cmgrote commented 4 years ago

Actually I'm not sure I understand this point about old gateways in general -- my understanding is that for them to not support response bodies they'd need to not support (predate?) the ~30-year old HTTP/1.0 spec, so I'm not sure I follow how they'd interpret / handle HTTP at all (status codes or response bodies) (?)

mandy-chessell commented 4 years ago

All options are valid by the spec, so we are just discussing style.

I was leaning towards option 2 but given the discussions above, I have changed my mind and think option 1 is better for us.

Option 1 retains backward compatibility for our APIs, ensures our errors will always get through and allows us to distinguish between requests that fail in the transport layer rather than in our server. This could be really useful in complex multi-cloud environments.

The more minor point is that it is the least amount of work and we are desparately overloaded.

mandy-chessell commented 4 years ago

I was just reading through the earlier comments and completely agree we need to document our REST API style - not just the use of the status codes - but also the way we format complex request/response bodies, the handling of exceptions, structure of the URL.

Our initial decision was that the REST APIs were private to Egeria and that technology is plugged in through the connectors - or by calling the client interfaces.

However, having the REST APIs as external APIs has advantages in supporting multi-programming environments - we see this advantage in the Python Notebooks - so it is reasonable to revisit this decision.

cmgrote commented 4 years ago

I can appreciate that option 1 has some advantages, but I also believe some potentially significant disadvantages.

In support of option 2:

I can appreciate that some codes (like 404) may seem particularly confusing in this server-internal vs transport-layer error debate between option 1 and option 2. However, I also see that even for a 404 response, given that our REST resource URLs for things like getEntity actually have the GUID in the URL itself, that it's debatable whether this is a valid URL or not in the same way that mis-spelling some other portion of the URL would be invalid (and validly result in a 404) -- from the perspective of a URL that does not exist, both seem equally relevant to me.

However, IMHO it would be better to take such debatable codes as an exception we explicitly handle rather than as a reason to entirely discard option 2: eg. perhaps under option 2 we instead opt not to use 404's at all, so for endpoints like getEntity we simply return a 200 status with a null or empty response object (rather than any error body). (For me it's debatable whether it's even an error that someone has requested an entity instance that we have not stored in our repository, for example, or whether we should simply report that the request succeeded but we have no such entity in our repository.)

Equally, I feel that other codes are quite clear-cut: a 500 Internal Server Error to me is a clear indication that something unexpected has gone wrong in our processing, within the server (and not at the transport level). And per my book above (😉), would be a very clear communication that does not require me to understand the response body format to understand that the server did not succeed in processing my request.

(Edited to also point out how option 2 can assist with command-line automation via common tools like curl which are often used in environments that won't have complex JSON parsing capabilities.)

planetf1 commented 4 years ago

My thoughts

One option (though it could be deemed procrastination?) is to post a proposal to make this change very clearly on our public list/chat & gather feedback over a release cycle. Perhaps linking in this with clarification over the status ( -> public? ) of the rest api

grahamwallis commented 4 years ago

Good discussion and I agree with pretty much everything that has been said, including the point that this is about more than just implementation of a rigid specification - it depends on how one applies the specifications in terms of mapping to architectural concepts (e.g. layers) and both specs/standards and precedent/opinion are important.

On both fronts I am still of the opinion that our APIs would be better if we were to implement option 2.

Reading Nigel's comment above I realise I should clarify my earlier definition of "option 2" in which I said "not worry about breaking compatibility". What I meant is that we should not attempt to preserve backward compatibility - e.g. by API versioning or introducing a dual API - those would be the activities of option 3 (which I am not in favour of). If we were to adopt option 2 I think we need to be very public about the proposed change and provide [at least] a release of warning. My thoughts were that we might propose the change with 1.4 and implement it in 1.5 (subject to approval by the wider community) - together with comprehensive documentation.

cmgrote commented 4 years ago

Yes, I think we’d have to have this release-long request for review on option 2 as per the 1.1 release notes I would consider (at least some of) the APIs to be “released functionality”: https://github.com/odpi/egeria/blob/master/release-notes/release-notes-1-1.md

(Though without thorough documentation on error formats, status codes, etc there is potentially still room for interpretation on its level of “released”...)

But as far as I'm aware we can't really do anything with Egeria without first configuring a cohort, repository, etc? Short of someone magically knowing how to write up a config document by hand (doubtful?) all of that requires the use of the administration API -- and I don't believe there's any client yet for the administration (only the API). So I think there's a pretty strong argument that at least some of the APIs are public, released, and have been fully adopted by everyone that's using Egeria 😉

cmgrote commented 4 years ago

Based on today's discussion we agreed that given where we are with other implementations building on the backbone of OMRS, that we would retain the existing style (option 1) for OMRS.

For OMAS endpoints, we could/should consider the use of option 2.

cmgrote commented 4 years ago

Closing as I believe this is now decided for OMRS, and permitted for OMASes.