nats-io / nats-architecture-and-design

Architecture and Design Docs
Apache License 2.0
177 stars 20 forks source link

Document in ADR handling of -ERR errors #106

Open wallyqs opened 2 years ago

wallyqs commented 2 years ago

Overview

Currently it is not documented which type of errors ought to close the connection, which errors can be skipped reported as an event to the async error handler and which type of errors should trigger a reconnect. In terms of code this maps to the following function in the base Go client: https://github.com/nats-io/nats.go/blob/5753a5f29250b05ca3b668015ecf404fd00ccf02/nats.go#L3209-L3237

aricart commented 2 years ago

All -ERR messages should trigger a close - if we want non terminal errors, we should expand the protocol to have a -WARN, this would simplify the clients, also we need to update the protocol to either use JSON or headers on this kind of message, parsing strings is not too helpful

matthiashanel commented 2 years ago

100% agree with Alberto on the message body. Perhaps going a similar way than jetstream and have status codes, or the very least details in addition to the fixed error string.

aricart commented 2 years ago

One important point is that if we treat errors as being a terminal condition, warnings imply recovery. If handled at a protocol level, then clients become more simple and straight-forward, and things like permission errors become warnings (in some service bootstrap conditions these are expected to fail until properly provisioned)

wallyqs commented 2 years ago

Feels like what we could do if making -ERR terminal is to send the rest WARN like messages as async status messages (type Msg) with some more details about the error etc...

caspervonb commented 1 year ago

100% agree with Alberto on the message body. Perhaps going a similar way than jetstream and have status codes, or the very least details in addition to the fixed error string.

Proposed this as a "somewhat backwards compatible" addition a few weeks back:

- ERR [error code] <error message>

Rationale being that legacy clients, if they parse by splitting the first word/whitespace would still parse it just fine and it would be in the error message (can of course still run into problems depending on how they match against the contents of the error message).