JSON serialization for deriving `event.id` is not adequately specified

shafemtol commented 1 year ago

Correct derivation of event.id is critical for correctly referencing, signing and verifying events. NIP-01 currently specifies the derivation of event.id as follows:

To obtain the event.id, we sha256 the serialized event. The serialization is done over the UTF-8 JSON-serialized string (with no white space or line breaks) of the following structure:
[
  0,
  <pubkey, as a (lowercase) hex string>,
  <created_at, as a number>,
  <kind, as a number>,
  <tags, as an array of arrays of non-null strings>,
  <content, as a string>
]

JSON itself does not specify a canonical way to serialize strings and numbers. That means the above specification is ambiguous and can lead to mutually incompatible implementations.

Excluding control characters, \ and ", JSON strings can directly represent any Unicode codepoint as itself. JSON provides some special escape sequences like \n, \\, \" and \/(!). In addition it provides \u, which can be used to represent any UTF-16(!) codepoint. A newline can be represented as either \n or, in theory, \u000a or \u000A. The letter ß can be represented as ß, \u00df, \u00DF or even something as silly as \u00dF. Some JSON implementations will escape any non-ASCII character by default, others might not. Some escape the forward slash by default, others do not.

The mention of "UTF-8 JSON-serialized" can be understood to mean that non-ASCII characters MUST NOT be escaped, but this is not clear.

I see two different ways to deal with the incompleteness of the current specification:

Encode strings as RFC 3629 UTF-8, do not escape characters in strings unless absolutely needed (\, ", control characters), and use the special, short escape sequences where available (e.g. \n instead of \u000a). For \u escape sequences I would assume lower case simply from two implementations I've tested, but this really has to be specified. Represent integers without any fraction or exponent.
Represent the values exactly as they are represented in the event object itself. This would make implementation easier when generating events and should be backwards compatible with all existing events, but it would complicate implementation everywhere else.

It seems to me that clients generally set created_at as an integer, although this should also be more clearly specified. If non-integer values are allowed here, that introduces even more issues such as the allowed precision (JSON itself has no limits here, but many implementations will decode to IEEE 754 binary64, silently rounding if needed), fraction and exponent representation.

Another thing that puzzles me is the mention of "non-null strings" here. If "null strings" should not exist in tags, I would expect that to be specified for the event object itself, not in the serialization for id derivation.

I'm leaning towards changing NIP-01 to specifying a strict serialization with minimal escaping of strings and not allowing non-integer created_at, unless doing so would make it incompatible with a significant portion of already existing events. In case of incompatibility with existing events, where their id has been derived using a different serialization, the alternative of using the same string serialization as for the event itself will be necessary, which impacts the json handling of both relays and receiving clients, making it rather brittle. Perhaps handling of such events can be made optional, so that relay and client implementations can choose to only accept strictly compliant events.

Some suggestions to address other potential issues regarding Unicode strings:

Clarify that Unicode normalization MUST NOT be performed on received events or during serialization for id derivation.
Require that UTF-8 strings (including during id derivation) comply with the latest UTF-8 spec, RFC 3629, which specifically disallows the CESU-8 variant.

I haven't seen this issue mentioned in this repo. It seems @vitorpamplona ran into the issue on nostr here: note1yef4z9ren4lt4xqj95yugzk2edghjvg0k3mxxfj5gm3fzxj4p6yspdtpwl

fiatjaf commented 1 year ago

created_at is always an integer, all tags are arrays of strings, events should be invalid otherwise. I don't see the point of saying that UTF-8 should not be normalized. Normalization changes the content, it's absurd to think there should be normalization.

The other things you're saying I don't understand, nor will 99.999% of the developers, so I don't see the point of clobbering NIP-01 with it. Specially because so far no one had run into any problems with escaping (except for the thing you linked), so there seems to be a reasonable standard among all JSON libraries.

If you think that is very important we can create a very detailed specification somewhere else and link to it from NIP-01. Can you do that?

shafemtol commented 1 year ago

so far no one had run into any problems with escaping

I worry: Is anyone actually looking? Most English-language notes won't contain non-ASCII characters, so you generally don't run into the problem. Are clients verifying the id and signature of events like they are supposed to? I've seen claims that many clients don't. Once they start taking verification seriously, will they be able to verify existing events, or will some be rejected because they were contain some non-ASCII character and were created by some less used client implementation?

If you think that is very important we can create a very detailed specification somewhere else and link to it from NIP-01. Can you do that?

I'm looking at RFC 8785 JCS, which aims to address the exact issue of canonical JSON representation for cryptographic purposes. It happens to specify string escapes exactly as I suggested, and it seems to be otherwise compatible with the current NIP-01 specification ("no white space or line breaks"). Maybe we can just refer to that.

shafemtol commented 1 year ago

I'm going to write a PR on NIP-01 to:

constrain events to the I-JSON (RFC 7493) subset (as required by JCS)
require that the serialization for id derivation conforms to JCS (RFC 8785)

All web clients most likely already conform to JCS in this case, as JCS rules for string and number serialization follow ECMAScript's JSON.stringify() (JCS also has rules for sorting properties in objects, but that's not relevant here).

I-JSON covers concerns about UTF-8 validity and binary64 numbers (though not too relevant if all numbers are integers anyway). Existing events should already be compliant unless they contain strings that are not valid UTF-8. JCS also covers the concern I mentioned about Unicode normalization (basically: don't do it).

vitorpamplona commented 1 year ago

I think simply writing the exact characters that should be escaped in strings is simple enough.

Just to give you all a real use case, I had to manually replace the escaping of \u2028 and \u2029 in Gson. The java lib forces the escaping of these chars, but nobody else in Nostr seems to do it. Since there was no spec clarification for this, I had to guess which chars should and should not be escaped when writing strings.

shafemtol commented 1 year ago

Just to give you all a real use case, I had to manually replace the escaping of \u2028 and \u2029 in Gson.

This code is buggy, by the way. If a user types the text \u2028 (not the character U+2028) into a note, it will be represented as \\u2028 in the json (i.e. the backslash is escaped). AIUI, your code changes that to \ (an unescaped backslash) plus the character U+2028, which is both invalid json and not what the user intended. And are you sure it won't escape other characters it shouldn't?

See Appendix G in RFC 8785. It links to implementations of JCS, including one in Java.

vitorpamplona commented 1 year ago

Correct. But there is no way to disable the escaping of those chars in GSON, so... Gotta live with it for now.

shafemtol commented 1 year ago

I think you missed two of my points:

The GSON hack is broken: it generates wrong id when the literal text \u2028 or \u2029 is encountered in a note.
Don't use GSON here. There are already compliant implementations.

Alternatively, implement a subset of JCS from scratch. You only need list, integer and string serialization. List and integer are pretty straightforward. String serialization is covered in section 3.2.2.2, in short: Only escape \, " and characters below U+0020. Use \\, \", \b, \t, \n, \f and \r for their respective characters, otherwise \u00hh, where hh is the lowercase hex representation of the codepoint.

If you absolutely MUST use GSON, do the replacement properly. I don't think regex is the best tool here. Assuming strings are UTF-16 internally: For each \ encountered, look at the next character. If it's not u, continue from after that character. Otherwise, read the next four characters as a hexadecimal integer. If it's >= 0x20, convert the whole sequence to the corresponding Unicode character. Assuming GSON has the correct rules about \\, \", \b, \t, \n, \f and \r, that should work.

vitorpamplona commented 1 year ago

I didn't miss it. It is broken. The good thing there is that there is no event in Nostr where the string \u2028 is part of the message yet compared to over 1000 events using the U+2028.

Gson does everything else you said already.. so somewhat compatible with JCS (just missing this \u2028 issue).

mikedilger commented 1 year ago

Whatever characters are in the event fields should be the ones signed. You aren't supposed to interpret these escape sequences and then digitally sign, you are supposed to sign the literal bytes of the string as it appears inside of the event. If the string has a literal --> <-- carraige return, then that is what gets signed. If the event field uses "\u000a", then that is what you sign... that is 6 characters, which is different from "\n" which is 2 characters.

As for how code should interpret user input and convert that into a string, whether it should escape various characters ... that's out of scope for the NIP in question.

A backslash escaping a backslash or a double quote is not part of the string being signed, it's just how a JSON string gets represented in certain contexts. If there are two backslashes in the content, then that is what you sign. If there is just one and it's invalid JSON, then that is what you sign.

EDIT: I seem to be wrong but I do not comprehend why just yet.

vitorpamplona commented 1 year ago

Everyone is parsing and regenerating the JSON. When it happens, ALL libs unescape and escape again. There is no way to get the "original" byte array.

shafemtol commented 1 year ago

Having to retain the binary representation of the JSON objects really makes things difficult. You get the worst of everything.

mikedilger commented 1 year ago

Everyone is parsing and regenerating the JSON. When it happens, ALL libs unescape and escape again. There is no way to get the "original" byte array.

Escaping is how you represent the data as a literal in a JavaScript code context to represent the literal data... the escape characters are not actually in the data and do not show up on the wire. Some data contains sequences that look like escape sequences but those sequences are to be preserved because they are part of the literal data on the wire.

I don't think JavaScript does anything special here except it will escape certain things in order to present those things in a code context, such as on your console logs. The reason I think this is that you don't just eval() JavaScript JSON strings, but rather you use the special functionalty of JSON.parse() and JSON.stringify().

I just scanned through the rust code here: https://github.com/serde-rs/json and it treats JSON strings as UTF-8 with no other processing for both serialization and deserialization. There is no unescaping and escaping. And gossip which uses serde_json has not been showing incorrect data (and if you want to test this, send me an event that you predict I will handle wrongly and we shall find out).

EDIT: I seem to be wrong but I do not comprehend why just yet.

Having to retain the binary representation of the JSON objects really makes things difficult. You get the worst of everything.

I argue the opposite. Having to deal with all the immense escaping corner cases and standards that you pointed to sounds like the absolute pit of hell to me.

mikedilger commented 1 year ago

I will read https://www.rfc-editor.org/rfc/rfc8785 and test serde_json and find out why I am wrong.

Ok JSON isn't data, it's code. It's code so that characters not representable in a text editor can be expressed. I presumed you could represent all UTF-8 characters in a text editor in modern times, but that's not quite true.

I bow out.

fwip commented 1 year ago

The spec is definitely ambiguous as-written - I read the first part of NIP01 and immediately came to the github issues for clarification. If maintaining backwards compatibility is optional, you may want to choose a non-JSON based serialization. Otherwise, following JSC seems like a strong choice.

This is one of the most fundamental pieces of the protocol, and ambiguities in how-is-this-content-hashed is not something you want in your spec. You should probably have a test-suite of messages designed to exercise all the corner cases possible in the format.

I also haven't seen it raised here, but as-written, the ordering of the tags is also ambiguous. Consider specifying that they are sorted under a certain locale. (Unless of course, the tags' ordering is meant to have semantic meaning.)

vitorpamplona commented 1 year ago

Tags order does have semantic meaning. In TextNotes for instance, the old model is first element is the root of a thread and the last e tag is the leaf. Similar for Private Messages, first e is the receiver.

fiatjaf commented 1 year ago

Tag order is not ambiguous, they are ordered, the order is clear.

I think a test suite would definitely be a great addition, more than a thousand lines of detailed JSON escaping notes (but there are welcome too).

Semisol commented 1 year ago

I think we should use the JS escaping rules.

shafemtol commented 1 year ago

I think we should use the JS escaping rules.

That is exactly what JCS (RFC 8785) specifies.

Semisol commented 1 year ago

@fiatjaf would JCS work, or at least defining what chars should be escaped and what shouldn't.

Semisol commented 1 year ago

Actually, it would be simple to just escape \r, \n, " and all control codes. It should not be hard to construct a custom JSON serializer just for this since the structure is known.

fiatjaf commented 1 year ago

I was thinking it would be much better to disallow \uxxxx escape sequences entirely. These things just bork all the code they touch, and they are completely unnecessary since we have UTF-8 now. Only allow \n, \t and \".

Relays can just start banning events that send other stuff. This will also lead to better parsing speeds (see #515).

boreq commented 11 months ago

Oh I started writing a long issue before I found this one.

I think we should move to try to come to some kind of a consensus on this since I am seeing plenty of events in the wild which can't be validated already.

The main concerns that I have are:

that requiring people to implement a custom encoder will make creating new implementations more difficult
we will break some (or even many) existing events by introducing new rules for creating the array used for event id calculation

That being said we are already in the situation where the event id calculation algorithm is "broken" and we need a way out of it sooner rather than later in my opinion. I would consider exploring switching to simpler binary-based options using formats with good preexisting library availability but I realise that is quite a drastic change.

fiatjaf commented 11 months ago

Most JSON encoders already abide to the rules defined here, they just use UTF-8. I think only very old encoders don't.

mleku commented 9 months ago

not that i expect anyone to actually put them in there but characters below 32 (space) (that are not already specified) in strings must be escaped as \uXXXX - this is not specified in the current version:

To prevent implementation differences from creating a different event ID for the same event, the following rules MUST be followed while serializing:

No whitespace, line breaks or other unnecessary formatting should be included in the output JSON.

No characters except the following should be escaped, and instead should be included verbatim:

A line break, 0x0A, as \n

A double quote, 0x22, as \"

A backslash, 0x5C, as \\

A carriage return, 0x0D, as \r

A tab character, 0x09, as \t

A backspace, 0x08, as \b

A form feed, 0x0C, as \f

UTF-8 should be used for encoding.

note that this is a very specific escaping rule set, and will probably not work on numerous JSON encoders though the output will likely be marshaled correctly by all of them

i've already seen a few places where the html entities being escaped has popped up and not just in users of the go standard library encoding/json but also on web apps, notably the greater than and less than symbols

probably this can be tolerated but this is the gamut of different quirks i've encountered so far in my study of this, clients should be aware of this possibility and render the html entities correctly since it is annoying to read > <

in my code, for sake of safety and sensible responses to rubbish data, if any character lower than 0x20 (32) appears in a string it is escaped to \uXXXX format

the reason why this is important is because if these cases are not covered, they could prove to be a problem later on and cause unspecified and insecure behaviour

mleku commented 9 months ago

Oh I started writing a long issue before I found this one.

I think we should move to try to come to some kind of a consensus on this since I am seeing plenty of events in the wild which can't be validated already.

The main concerns that I have are:

that requiring people to implement a custom encoder will make creating new implementations more difficult

we will break some (or even many) existing events by introducing new rules for creating the array used for event id calculation

That being said we are already in the situation where the event id calculation algorithm is "broken" and we need a way out of it sooner rather than later in my opinion. I would consider exploring switching to simpler binary-based options using formats with good preexisting library availability but I realise that is quite a drastic change.

I think that the existing rules are not difficult to write un/escape processing to conform to it, but i also think that the canonical [0,...] format entails a lot of excess processing that would be better avoided.

I have been thinking a lot about this and I think the solution will be to make a special tag which can carry an encoding identifier, the ID and the signature, which must be elided during the validation to reconstitute the form that derives the ID hash

However, adding this to the NIPs as stated in the main readme requires two clients and a relay to implement this feature. I'm focused on implementing the current existing baseline for the bulk of the protocol, at least as much as the relay is concerned, in a few months time I would like to get into discussions on creating a fast validating compact binary format... with varints and a relatively rigid structure, and more strict rules about message sentinel ambiguity than we currently have with COUNT and AUTH envelopes.

mikedilger commented 9 months ago

Unprintable characters below ASCII 32 which don't have explicit escape sequences listed are invalid. Such events should be thrown out. Such characters are inappropriate in Text Notes.

As for HTML entity escaping, somebody has a bug because JSON is not HTML and HTML entities in TextNotes will be recognized as their raw characters, not a the HTML entity they encode. If you are writing a web client and wish to display a text note, your client should then do HTML entity encoding prior to writing it into an HTML document. But at the protocol level there is no such encoding.

fiatjaf commented 9 months ago

The Python default json.dumps() will print unicode characters as \uXXXX escape codes, but all the Nostr Python libraries have already removed that issue by calling it with ensure_ascii=False.

Another great win for simplicity and interoperability.

mleku commented 9 months ago

Unprintable characters below ASCII 32 which don't have explicit escape sequences listed are invalid. Such events should be thrown out. Such characters are inappropriate in Text Notes.

As for HTML entity escaping, somebody has a bug because JSON is not HTML and HTML entities in TextNotes will be recognized as their raw characters, not a the HTML entity they encode. If you are writing a web client and wish to display a text note, your client should then do HTML entity encoding prior to writing it into an HTML document. But at the protocol level there is no such encoding.

ah, sounds good in theory but many json encoders also html escape in addition to RFC 8259

what is the UX supposed to be for handling "invalid control characters in strings" exactly?

how about we make it clear, they are elided and ignored?

the specification says nothing about what to do with strings with control characters in it that are not named

add the line to the spec:

"All characters not listed above that are below ascii 32 (space) should be elided and no error displayed to the user."

all ASCII codes above ~ (126 iirc) or so are considered valid in JSON so nothing needs to be said about that

as for events arriving with them, yes, i agree, just ignore them, that's a broken client sending them out (or malicious)

mleku commented 9 months ago

The Python default json.dumps() will print unicode characters as \uXXXX escape codes, but all the Nostr Python libraries have already removed that issue by calling it with ensure_ascii=False.

Another great win for simplicity and interoperability.

again, specify it in the NIP

python is a 20 year old language that is notoriously slow btw i don't think it's an economic choice for anyone to use for infrastructure these days

nostr-protocol / nips

JSON serialization for deriving `event.id` is not adequately specified #354