nats-io / nats.java

Java client for NATS
Apache License 2.0
563 stars 153 forks source link

Message consumer garbles subjects with non-ascii characters #1143

Closed cjohansen closed 2 months ago

cjohansen commented 3 months ago

Observed behavior

I set up a consumer for a topic containing the utf-8 character ø, e.g. test.løp. nats subscribe 'test.løp' picks up messages published with nats publish test.løp "Hello". The jnats consumer receives a message with the subject "test.lᅢᄌp".

jnats is able to publish messages with non-ascii characters correctly.

Expected behavior

I expect the subject received by jnats to be read as the UTF-8 string "test.løp", not "test.lᅢᄌp"

Server and client version

jnats 2.17.7 nats-server 2.10.12

Host environment

Mac OSX

Steps to reproduce

package nats.example;

import io.nats.client.Connection;
import io.nats.client.Message;
import io.nats.client.Nats;
import io.nats.client.Subscription;
import java.nio.charset.StandardCharsets;
import java.time.Duration;

public class Demo {
  public static void main(String[] args) {
    Connection nc = Nats.connect("nats://localhost:4222");

    new Thread(() -> {
        Subscription sub = nc.subscribe("test.>");
        Message msg = sub.nextMessage(Duration.ofSeconds(1));

        System.out.printf("Received \"%s\" on \"%s\"\n",
                  new String(msg.getData(), StandardCharsets.UTF_8),
                  msg.getSubject());
    }).start();

    String message = "Hello world!";
    nc.publish("test.løp", message.getBytes(StandardCharsets.UTF_8));
  }
}
scottf commented 3 months ago

Understood. We recognize that the server actually will support utf-8 subjects, but for cross client compatibility, at one point we decided the clients would not and suggest using ascii only as noted in our docs: https://docs.nats.io/nats-concepts/subjects

That being said, we discussed this topic today, and decided that it is okay to allow the clients to opt-in for this behavior, so I will add this to my todo list.

cjohansen commented 3 months ago

Ah, I see. I thought I read somewhere that subjects were allowed to use "any printable character", but I can't find the reference. Happy to hear it will be fixed 👍

scottf commented 3 months ago

Ah, I see. I thought I read somewhere that subjects were allowed to use "any printable character", but I can't find the reference. Happy to hear it will be fixed 👍

Probably in ADR-6

roeschter commented 2 months ago

I clarified the "ASCII only" for subjects and other names in NATS. https://docs.nats.io/nats-concepts/subjects

We don't support UTF8 for the same reasons its a bad idea to use Unicode for names on the web. Names need to be shared in all kinds of documents, written to logs files and sometimes typed by people. Once you allow anything outside ASCII you open the floodgates and you cannot work with peoples systems in other countries anymore.

If you want to know where this can lead I suggest some computer archaeology into the 90s when Microsoft "localized" Visual basic by actually translating the language key words.

cjohansen commented 2 months ago

If this is how you want to do it, that's fine. I will add: The developer experience would be much smoother if both the server and the clients did the same validation on this. We ran into this problem because of inconsistent behavior between the server, CLI and Java SDK. Had non-ascii characters been outright disallowed this wouldn't have been an issue.

If you don't want to tighten validations to avoid breaking backwards compatibility then I completely agree. But in that case, making the Java SDK behave the same way as the server and CLI (e.g. more permissive, by reading subject names as UTF-8) would have given fewer surprises. It could even issue a warning when non-asciis are detected at creation time.

roeschter commented 2 months ago

Correction: There was a bit of a mixup between recommendations for subject usage and support for UTF-8.

We decided to bring back the UTF-8 support for most clients. In general UTF-8 should work. But it may be optional for some client implementations.

We recommend to NOT use non-ASCII characters in subjects as this can cause all kinds of issues in configuration files, command line tools and simply people being able to read and configure the subjects. I'm speaking as somebody who worked with the Unicode standard in IT systems since 1995(!!!). If you use non ASCII in "names" - its at your own risk.

cjohansen commented 2 months ago

Thanks for clarifying, this sounds good to me 👍😊

roeschter commented 2 months ago

I have found the issues (subject not utf8 decoded for incoming messages) and will suggest a fix.

PS: Its a feature. We are planning to optionally allow UTF-8 again in a future release. The code exists but is disabled for the current release.

scottf commented 2 months ago

@cjohansen I have merged the PR https://github.com/nats-io/nats.java/pull/1169

This allows you to turn on UTF-8 support via the connect options with the supportUTF8Subjects builder method.

You can publish with a UTF-8 subject whether or not this flag is set, since all that happens outgoing is we convert strings to byte arrays using UTF-8 character encoding anyway.

But on incoming messages, it's a different code path to process messages that might expect a UTF-8 subject, so the option is required.

cjohansen commented 2 months ago

Awesome, thanks a lot 😊👍