matrix-org / matrix-spec

The Matrix protocol specification
Apache License 2.0
193 stars 95 forks source link

Behaviours of Canonical JSON not thoroughly documented #1245

Open neilalexander opened 2 years ago

neilalexander commented 2 years ago

Right now the spec provides a Python snippet to implement Canonical JSON:

import json

def canonical_json(value):
    return json.dumps(
        value,
        # Encode code-points outside of ASCII as UTF-8 rather than \u escapes
        ensure_ascii=False,
        # Remove unnecessary white space.
        separators=(',',':'),
        # Sort the keys of dictionaries.
        sort_keys=True,
        # Encode the resulting Unicode as UTF-8 bytes.
    ).encode("UTF-8")

This doesn't adequately document the actual behaviours, but instead has led us into a situation where the Python implementation is the only "correct" one.

Needs clarity to explain at least:

(created from #1232)

richvdh commented 2 years ago

link to the relevant bit of spec: https://spec.matrix.org/v1.3/appendices/#canonical-json

  • which numeric formats are appropriate to appear on the wire (i.e. should scientific notation like 1e9 ever appear?)

No. I think this is implied by "Numbers in the JSON must be integers...", and certainly Synapse's behaviour here is consistent, but the spec could be more explicit. PR to clarify this would be appreciated.

  • upper and lower bounds of all numeric values for both pre- and post-v6 rooms

pre-v6 is #1244. Post-v6 is I think clearly specced by "Numbers in the JSON must be integers in the range [-(2**53)+1, (2**53)-1]."

  • whether or not implementations should be expected to use IEEE 754 for floats, given they can appear in some old rooms

Again, old rooms are #1244.

  • how to handle unicode (escaping, UTF-16 surrogates, etc.)

Yeah this definitely needs clarifying. The only relevant text at the moment is in the python snippet: "Encode code-points outside of ASCII as UTF-8 rather than \u escapes". The presence of the python snippet means that the behaviour is well-defined; it's just not defined in a way that is helpful for anyone not writing python. So, a PR to fix this would be very helpful.

  • precedence order of duplicate keys

I think this is #1246?