Open tonyg opened 6 years ago
These are fantastic questions and call into question whether including sets as a data structure are actually even a good idea (cc @benlaurie)
To break this down into concrete issues:
A) If I understand correctly is about unicode canonicalization. In similar work (i.e. objecthash) this is a "knob", i.e. implementations may selectively enable unicode canonicalization, and in TJSON I'd suggest pursuing something similar. What the default should be is debatable, but I'd be in favor of canonicalizing by default. In the meantime this is unaddressed in the spec, but probably should be, and probably deserves its own issue.
B) In my opinion this should be rejected, as these two representations map to the same content, despite the type signatures being different
C) Should be rejected
D) Should be rejected
I think it might actually be interesting to lean on objecthash for solving this problem: if 2+ members of the set compute the same objecthash, the message should be invalid. However, I'm not sure it makes sense to make objecthash a mandatory entangling dependency of TJSON.
Leaning on objecthash is definitely interesting, since a strong hash function computes equivalence classes (with high probability). Alternatively, it could be within reach to define an equivalence relation for TJSON itself. This could be the foundation of lots of other stuff; JSON lacks such a relation and it's at the root of a lot of the headaches people have with it. (Maybe you could even define a total ordering over TJSON terms! That'd be even handier.)
I also am inclined to think B, C and D should be invalid sets.
Regarding A, though, and unicode normalization - could it be that the right thing is to leave it to readers/writers to normalize or not? And that the equivalence for strings should be code point by code point (or as RFC 7159 sec 8.3 says, "code unit by code unit", ew)?
As an outside crazy idea: could tagging an expected normalization form make sense?? Consumers could then reject and/or renormalize if a contained, tagged string did not match its declared expected normalization. {"fruit:s:nfc": "päron", "name:s:nfkc": "tony"}
Finally, I want to propose a couple more cases for consideration:
E) Is this a valid set?
{"maybe-valid-set:S<O>": [
{"meaning-of-life:i": "42"},
{"meaning-of-life:u": "42"} ]}
F) Is this a valid set?
{"maybe-valid-set:S<O>": [
{"meaning-of-life:i": "42"},
{"meaning-of-life:f": 42.0} ]}
One thing that might massively simplify sets for now is to only allow sets of scalars. That would invalidate B-F.
Otherwise this seems like a deep rabbit hole...
That could definitely help. The user would specify S<i>
, S<f>
, S<s>
and so on, making terms like {"x:S<i>": [ "42", 42, 42.0 ]}
ill-formed. For S<s>
, picking codepoint-by-codepoint comparison still seems like the right choice to me.
TJSON looks really nice! Thank you for your work on the specification thus far. I have some questions relating to TJSON's equivalence relation.
A) Is this a valid set?
Its UTF-8 encoding is as follows:
B) Is this a valid set?
C) Is this a valid set?
D) Is this a valid set?