NIP-01 suggests encoding "content" in a non-JSON compatible way

nostr-protocol / nips

Nostr Implementation Possibilities

2.38k stars 577 forks source link

NIP-01 suggests encoding "content" in a non-JSON compatible way #1403

Open Vap0r1ze opened 3 months ago

Vap0r1ze commented 3 months ago

The base protocol (NIP-01) draft currently says this:

The following characters in the content field must be escaped as shown, and all other characters must be included verbatim:

A line break (0x0A), use \n

A double quote (0x22), use \"

A backslash (0x5C), use \\

A carriage return (0x0D), use \r

A tab character (0x09), use \t

A backspace, (0x08), use \b

A form feed, (0x0C), use \f

It says "all other characters must be included verbatim", but the JSON standard (see Section 9 "String") requires that "the control characters U+0000 to U+001F" are escaped using \uXXXX unicode escapes.

An example of a "content" value that is valid in NIP-01 but invalid in JSON:

JSON.parse(`"\u0000"`)

At this point it's probably not feasible to change the draft to use valid JSON, but the draft should probably mention that you must deviate from the JSON standard to produce NIP-01 compliant event IDs.

mikedilger commented 3 months ago

As I recall the intent was that those characters are invalid nostr characters, so we don't need encodings for them.

fiatjaf commented 3 months ago

Unicode escape codes are an aberration from a distant past that should be forgotten.

As long as you're not doing anything super weird this problem won't happen and most default JSON encoders will do the right thing.

Vap0r1ze commented 3 months ago

After looking into this, there's more than just 0x00-0x1F that this "problem" exists for. That section of NIP-01 is essentially trying to restate the ECMAScript spec's QuoteJSONString (how JSON.stringify handles strings), to try an ensure determinism. There are two more ranges that QuoteJSONString uses \uXXXX escapes for, but those doesn't matter much since they only exist to cope with how JavaScript strings don't need to be valid in any encoding.

I think to prevent headache for someone who decides to implement their own JSON (de)serializer, NIP-01 could:

Specify these restrictions on all arbitrary strings rather than just event.content (like those inside event.tags)
Either:
1. Require that the decoded strings are valid UTF-8 and disallow control codes
2. Refer to ECMAScript's QuoteJSONString for deterministic string serialization.

As much as I would like the ability to send raw control codes, given that terminals are very much not "a distant past". I do think that option 1 is more ideal so that the string values are ensured to be valid utf-8, making compliant parsing easy for both JSON.parse users (no encoding required) and serde_json users (must be valid UTF-8 since it uses std::string::String)

fiatjaf commented 3 months ago

Very good points. I agree.