nats-io / nats-architecture-and-design

Architecture and Design Docs
Apache License 2.0
177 stars 20 forks source link

Pull Request Error/Warning Handling #116

Closed scottf closed 1 year ago

scottf commented 1 year ago

Overview

How should pull request statuses be handled? All statuses come in the form of subscription messages versus an api reply.

Errors / Fatal

Warning

Client Handled

These signify end of data and can handled by the client.

Clients and Tools

Other Tasks

Client authors please update with your progress. If you open issues in your own repositories as a result of this request, please link them to this one by pasting the issue URL in a comment or main issue description.

scottf commented 1 year ago

@derekcollison what would you think about changing the the pull request from a publish to a request/reply. This would allow returning an api error when the pull request is an error against the consumer limits. It seems strange that I have to wait until I get the first message to find out my pull request was illegal. You would have to move the subscription subject into the payload as the reply-to would be used for the request/reply, but other than that it would be the same and could be backward compatible; if the the payload has a subscription subject, the message-reply to is the request reply-to. Otherwise it's the same as it always was.

wallyqs commented 1 year ago

@scottf the problem with that approach is that then imports/exports get extra tricky, we avoided that idea in this PR: https://github.com/nats-io/nats-server/pull/1929

scottf commented 1 year ago

There is more to this. Please see https://github.com/nats-io/nats-server/issues/3165

derekcollison commented 1 year ago

We examined sending messages from a pull request on a different subject, but as @wallyqs points out that has challenges cross account.

I am more than happy to change the 4xx code as you pointed out @scottf to something the clients can detect and decide if they need to surface as an async error.

I believe, but could be wrong, that the only time that would happen is if the user picked a value that was smaller then a given message destined for the consumer. If we picked it (hopefully we would not make it smaller than a single message, but possible), we could do a retry on our own at the client library level.

scottf commented 1 year ago

@derekcollison @aricart The 408 'Message Size Exceeds MaxBytes' is already released in v.2.8.4 So at this point if we change it to a different code we are just going to have to have client code that is version aware, so I'm withdrawing my request to have the code changed and instead support Alberto's suggestions in https://github.com/nats-io/nats-server/issues/3165.

derekcollison commented 1 year ago

The 4xx simply says to the requestor the request was invalid for some reason. The client library can, and should, deal with most of these imo.

If we feel the client library has no recourse and can not deterministically deal with the bad request we can surface that to the user via async error callbacks.

I agree we should change the error code for max bytes exceeded. If we agree that it should be a 409 we can make that change and can be part of next release.

derekcollison commented 1 year ago

No client support has been officially announced and it was experimental. If it should be a 409 let's make the change.

scottf commented 1 year ago

Re-opened. This issue has to do with handling of errors, not the behavior that sends them.

aricart commented 1 year ago

@derekcollison in that vein, we can possibly consider MaxWaiting Exceeded as a 408 - as this is a transient condition? Effectively if we are saying that 409s are terminal conditions, that means that iterators would end. Callbacks can perhaps continue, as the error can be specified.

derekcollison commented 1 year ago

MaxWaiting is time to wait, isn't that a 408 already?

aricart commented 1 year ago

@derekcollison MaxWaiting is a 409

derekcollison commented 1 year ago

Apologies, this is when you say there can only be 100 requests waiting total, and a new request shows up. So that is not a 408, its a 409 imo.

aricart commented 1 year ago

@derekcollison correct - the issue here is that if you send the request again, it will succeed (or at some point it will). The others don't. This is why we need clarity on what a terminal error is. Right now all 409s are terminal except for this guy. In go and callback apis, it doesn't matter what the error is, unfortunately these telegraph out - even if no one should see them. But the minute you have an iterator, the question we want to answer is "can I do it again."

derekcollison commented 1 year ago

Understood, but maybe this one is correct in that we surface to the user when its encountered?

aricart commented 1 year ago

I am saying the opposite, since this is something that if the client keeps on pulling, it will eventually succeed, this should be a 408 instead of the 409 - so we would NOT surface it or end their iterator.

derekcollison commented 1 year ago

It's not a timeout. We are following HTTP status codes.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/408

derekcollison commented 1 year ago

Maybe another here fits better?

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#client_error_responses

aricart commented 1 year ago

429 is too many requests.

aricart commented 1 year ago

We can take this offline - the guidance we are looking for is errors that are transient, and the client can re-attempt - Here's how I would have mapped them: 1xx - Heartbeats, Flow Control, etc, - libraries or user code can handle this how they want. 204 - No messages - request succeeded but there is no content to give, try again. 400 - client requests that exceed limits on the consumer (still think that the server should clamp a larger value from a client to the consumer config value) - this request will never succeed unless the request options change 404 - Not found (doesn't exist), this will never succeed unless something else changes (resource is created) 408 - for server signaled timeout (request ended, so client can ask again) 410 - This is a direct lookup of message that was deleted. 412 - for things like mismatch between consumer configuration and access pattern 429 - MaxWaitingExceeded or too many acks pending (asks the client to slow down)

In all cases these codes can be handled blind - meaning there should be no code on a client that tries to parse the description of the status to determine what to do.

derekcollison commented 1 year ago

I like those..

derekcollison commented 1 year ago

@kozlovic could you give me your opinion on the codes above if we wanted to make those changes for 2.9?

kozlovic commented 1 year ago

@derekcollison Nothing against those codes, but obviously we have to take into account that "older" clients (that is, all currently released) may misbehave if server changes following above request. The one that comes to mind is that 404 that I know was handled specifically by some libraries (I know of Go and C clients).

derekcollison commented 1 year ago

That is fair, could I ask you, @aricart and @scottf discuss and decide best path forward for 2.9?

aricart commented 1 year ago

Would the JS API errors be an option? Currently, the clients check for a JS API Error for all responses except for in-stream responses. So the question is if we maintain the current status code for now, but also include a JS API Error, the client could make a better decision. The current codes would trigger the current behaviors, but clients could decide to ignore that and inspect JS API Errors for simpler handling (with a better UX).

The current approach is problematic because it makes many assumptions (message must have no payload), and clients have to compare strings to do the right thing. So changing the codes will break all clients, but currently changing one code may break the client too.

aricart commented 1 year ago

Given the level of support for JetStream, I wonder if clients can send a header with the request, and the server can then flip the codes. This means the old codes will be as they are in perpetuity, but server will render better codes given a hint on the request.

scottf commented 1 year ago

Given the level of support for JetStream, I wonder if clients can send a header with the request... What about adding a field to the CONNECT setting some feature flag?

aricart commented 1 year ago

Given the level of support for JetStream, I wonder if clients can send a header with the request... What about adding a field to the CONNECT setting some feature flag?

depending on the JetStream topology - the server you send the option, won't see it - Better approach would be a versioned request subject to JS APIs

kozlovic commented 1 year ago

What about the server instead of sending: "NATS/1.0 404 No Messages\r\n\r\n", would send the same but add a field such as "NATS/1.0 404 No Messages\r\nNats-Err: 204\r\n\r\n"?

Not sure if server can easily differentiate the conditions giving in @aricart list, @derekcollison will know.

That is, all errors that you mentioned @aricart would be in addition to existing status, with simple ErrCode/Value key/value. With that I could see:

All that being said, @aricart, you would have to qualify what "can re-attempt" means for a client library perspective (and what cannot be re-attempted and what does that mean for libraries that call Fetch()).