nats-io / nats-architecture-and-design

Architecture and Design Docs
Apache License 2.0
177 stars 20 forks source link

ADR-4 Proposal: alternate character encodings in headers #165

Closed caleblloyd closed 1 year ago

caleblloyd commented 1 year ago

NATS users may want to send header keys and values other in encodings other than ASCII. This is especially popular in languages other than English.

Highlight that header key/value encodings that are backward-compatible with ASCII can optionally be offered by NATS Clients.

Allow clients to select UTF-8 as their default encoding, and annotate that UTF-8 is the preferred character encoding.

Understandably, client implementations may rely on HTTP header parsing libraries that enforce ASCII. For this reason, and also for backward compatibility reasons, ASCII should remain a valid default encoding option.

caleblloyd commented 1 year ago

As it turns out, nats.go already partially supports UTF-8, at least in the Header Values:

    m := nats.NewMsg(subject)
    m.Header.Add("Test-Utf8", "😃😁😂")
    nc.PublishMsg(m)
    msg, _ := sub.NextMsg(time.Second)
    fmt.Println(msg.Header.Get("Test-Utf8"))

    // prints "😃😁😂"
scottf commented 1 year ago

So are we talking about the client encoding across the wire? For instance if the user presents z Ḁ Ḃ Ḉ x as the actual value, we actually transfer it as z\u0020\u1e00\u0020\u1e02\u0020\u1e08\u0020x and that somewhere in our headers handling, the user can tell us to do this encoding?

caleblloyd commented 1 year ago

@scottf correct, this is to determine how to Encode Header key/value strings -> byte[] on the wire when sending a message, and Decode Header key/value byte[] -> string off the wire when receiving a message

scottf commented 1 year ago

So that's encoded to ascii, similar to json/url encoding. Java built in string conversion to bytes allows for an encoding such as UTF-8, but for instance these are the actual 15 bytes 122 32 -31 -72 -128 32 -31 -72 -126 32 -31 -72 -120 32 120 it makes for that string, so it's really just double(?) byte, so just making sure that is not what we want.

caleblloyd commented 1 year ago

UTF-8 is variable length encoding, between 1-4 bytes per character. So the []byte count may not be equal to the character count, the header payload length would need to be the encoded bytes length.

ColinSullivan1 commented 1 year ago

Am not opposed to providing additional encodings but imo the default should be ASCII, which all clients would need to support. This would allow NATS to better facilitate direct interop with HTTP, gRPC, etc and is the lowest common denominator across languages, tooling, browsers, etc. Could expand to support RFC 2047 (MIME) as in ISO-8859-1 (https://www.w3schools.com/charsets/ref_html_8859.asp). Is there a specific use case driving the proposal?

More on this here: https://www.jmix.io/blog/utf-8-in-http-headers/

caleblloyd commented 1 year ago

The reason for wanting UTF-8 to be the preferred default is it would allow for much broader character support out-of-the-box in headers while maintaining byte-for-byte compatibility with ASCII if only ASCII characters are in-use.

string in most programming languages supports Unicode. It seems like a very limiting practice to allow the user to create a header collection of Dictionary<string, string[]> with full Unicode support, but then limit that down to just 128 ASCII characters.

If folks want to interop with HTTP that is fine, just don't use any non-ASCII characters, and UTF-8 will encode the to the exact same thing as ASCII.

caleblloyd commented 1 year ago

It may also be good to figure out what the reality is with clients already. .NET and Java enforce ASCII only. Go appears to allow UTF-8 in Header Values but not Keys. Would be good to know what some of the other clients are doing by default right now.

scottf commented 1 year ago

+1 to provide for optional encoding, but I think default should be restricted as currently defined, not even UTF-8. The caveat being that if we provide a way to optional encode in one client, this really becomes a parity item that all clients must provide optional encoding.

aricart commented 1 year ago

The reason for the keys to be ASCII is because that is the HTTP and MIME header specification. one question on this is whether this is customer driven or we just want to be able to support it. The main issue as Scott points out, the minute that a client writes one of these encoded things, client parity for all clients becomes an issue.

Jarema commented 1 year ago

I would think that this is pretty rare use case, especially that NATS headers are perceived as being pretty close to HTTP headers.

Not being actually HTTP header spec compatible already caused problems in quite a few languages that wanted to reuse really well optimized and tested http headers libraries, but couldn't (as our NATS headers for JS/KV/ObjectStore are required to be sent case-sensitive to the server). This change could make things harder.

I see one more reason: as we're having some ideas around nice NATS microservices tooling, connectors and more, straying further from HTTP headers spec could bring even more problems.

I think the simplest solution is to just suggest users to keep Keys according to HTTP headers spec, which should not be a big issue, and if they need non-asci values, just base64 them.

caleblloyd commented 1 year ago

I don't think it is that rare of a use case, say an app wants to pass around X-User-Name and someone's name is in a different language not covered by ASCII.

But upon further reading of the spec, it seems that optional encoding is allowed in field values. Excerpt from RFC9110 for Field Values:

Field values are usually constrained to the range of US-ASCII characters [USASCII]. Fields needing a greater range of characters can use an encoding, such as the one defined in [RFC8187].

So maybe this isn't even needed then, it is already laid out in the HTTP spec. ASCII should probably be the default, but clients "can use an encoding"