Sets and equivalences - Githubissues

tonyg commented 6 years ago

TJSON looks really nice! Thank you for your work on the specification thus far. I have some questions relating to TJSON's equivalence relation.

A) Is this a valid set?

{"maybe-valid-set:S<s>": ["päron", "päron"]}

Its UTF-8 encoding is as follows:

00000000: 7b22 6d61 7962 652d 7661 6c69 642d 7365  {"maybe-valid-se
00000010: 743a 533c 733e 223a 205b 2270 c3a4 726f  t:S<s>": ["p..ro
00000020: 6e22 2c20 2270 61cc 8872 6f6e 225d 7d    n", "pa..ron"]}

B) Is this a valid set?

{"maybe-valid-set:S<O>": [ {"a:A<>":[]}, {"a:A<s>":[]} ]}

C) Is this a valid set?

{"maybe-valid-set:S<O>": [ {"a:s":"m", "z:s":"n"}, {"z:s":"n", "a:s":"m"} ]}

D) Is this a valid set?

{"maybe-valid-set:S<O>": [
    {"hi:d16": "48656c6c6f2c20776f726c6421"},
    {"hi:d64": "SGVsbG8sIHdvcmxkIQ"} ]}

tarcieri commented 6 years ago

These are fantastic questions and call into question whether including sets as a data structure are actually even a good idea (cc @benlaurie)

To break this down into concrete issues:

A) If I understand correctly is about unicode canonicalization. In similar work (i.e. objecthash) this is a "knob", i.e. implementations may selectively enable unicode canonicalization, and in TJSON I'd suggest pursuing something similar. What the default should be is debatable, but I'd be in favor of canonicalizing by default. In the meantime this is unaddressed in the spec, but probably should be, and probably deserves its own issue.

B) In my opinion this should be rejected, as these two representations map to the same content, despite the type signatures being different

C) Should be rejected

D) Should be rejected

I think it might actually be interesting to lean on objecthash for solving this problem: if 2+ members of the set compute the same objecthash, the message should be invalid. However, I'm not sure it makes sense to make objecthash a mandatory entangling dependency of TJSON.

tonyg commented 6 years ago

Leaning on objecthash is definitely interesting, since a strong hash function computes equivalence classes (with high probability). Alternatively, it could be within reach to define an equivalence relation for TJSON itself. This could be the foundation of lots of other stuff; JSON lacks such a relation and it's at the root of a lot of the headaches people have with it. (Maybe you could even define a total ordering over TJSON terms! That'd be even handier.)

I also am inclined to think B, C and D should be invalid sets.

Regarding A, though, and unicode normalization - could it be that the right thing is to leave it to readers/writers to normalize or not? And that the equivalence for strings should be code point by code point (or as RFC 7159 sec 8.3 says, "code unit by code unit", ew)?

As an outside crazy idea: could tagging an expected normalization form make sense?? Consumers could then reject and/or renormalize if a contained, tagged string did not match its declared expected normalization. {"fruit:s:nfc": "päron", "name:s:nfkc": "tony"}

Finally, I want to propose a couple more cases for consideration:

E) Is this a valid set?

{"maybe-valid-set:S<O>": [
  {"meaning-of-life:i": "42"},
  {"meaning-of-life:u": "42"} ]}

F) Is this a valid set?

{"maybe-valid-set:S<O>": [
  {"meaning-of-life:i": "42"},
  {"meaning-of-life:f": 42.0} ]}

tarcieri commented 6 years ago

One thing that might massively simplify sets for now is to only allow sets of scalars. That would invalidate B-F.

Otherwise this seems like a deep rabbit hole...

tonyg commented 6 years ago

That could definitely help. The user would specify S<i>, S<f>, S<s> and so on, making terms like {"x:S<i>": [ "42", 42, 42.0 ]} ill-formed. For S<s>, picking codepoint-by-codepoint comparison still seems like the right choice to me.

tjson / tjson-spec

Sets and equivalences #54