gibson042 / canonicaljson-spec

Specification of canonical-form JSON for equivalence comparison.
http://gibson042.github.io/canonicaljson-spec
19 stars 9 forks source link

Explain the difference with JCS / draft-rundgren-json-canonicalization-scheme-06 #5

Open matrey opened 5 years ago

matrey commented 5 years ago

I stumbled upon this: https://cyberphone.github.io/ietf-json-canon/ https://tools.ietf.org/html/draft-rundgren-json-canonicalization-scheme-06

Your work is listed there, with the remark "In contrast to JCS which is a serialization scheme, the listed efforts build on text level JSON to JSON transformations".

Appendix H. Other JSON Canonicalization Efforts

There are (and have been) other efforts creating "Canonical JSON". Below is a list of URLs to some of them:

o https://tools.ietf.org/html/draft-staykov-hu-json-canonical-form-00 [7]

o https://gibson042.github.io/canonicaljson-spec/ [8]

o http://wiki.laptop.org/go/Canonical_JSON [9]

In contrast to JCS which is a serialization scheme, the listed efforts build on text level JSON to JSON transformations.

Could you explain in layman's terms what are the actual differences, if any? And which spec / library would you recommend for which usage?

simon-greatrix commented 5 years ago

A canonical JSON form should be able to represent anything JSON can represent in a unique way. Strings and numbers are complicated beasts when you dig down into their fine details and the various proposals differ in how they handle them.

http://wiki.laptop.org/go/Canonical_JSON Avoids the complexity of floating point numbers by banning them completely. As JSON allows floating point numbers, this fails to actually be a canonical JSON format.

https://tools.ietf.org/html/draft-rundgren-json-canonicalization-scheme-06 Explicitly forbids "lone surrogates" in Strings which JSON standard clearly allows. Hence this also fails to be a valid canonical JSON format. To explain, Unicode represents some specialised characters (such as emoticons) with pairs of characters. The pair must consist of both a high surrogate and a low surrogate. It is invalid Unicode to have a surrogate outside of a pair, but it is valid JSON.

In my opinion, there is no good reason to generate invalid Unicode. If you want to send binary data, then Base-64 produces a more compact representation than UTF-8. If you want to use surrogates as markers, then you should be using the private use characters. However, "no good reason" does not allow us to ignore the standard that says invalid Unicode is allowed in JSON.

Additionally, this proposal requires a complicated method for representing floating point numbers. Complexity leads to inaccurate implementations and fragile applications built upon them.

In contrast to the above, the canonical representation described in this project allows all valid JSON to have a valid and unique representation.

cyberphone commented 5 years ago

@simon-greatrix @matrey None of the proposals are "perfect" for the simple reason that JSON was not designed to support canonicalization. That no such proposal has become a standard (real or de-facto) seems to say the same thing.

Textual canonicalization like this scheme is simple and covers the entire JSON specification but has downsides when it comes to integration in JSON tools. My take on this topic (draft-rundgren-etc) "cripples" JSON to the I-JSON level but is easier to integrate since it ultimately can be a part of a JSON serializer only. Number serialization is indeed a tricky problem but other people have done an awesome job in this area so I'm not particularly worried about that anymore. 5 different platforms currently perform identically on a set of 100 million test values.

The only real problem I have stumbled upon is described here: https://tools.ietf.org/html/draft-rundgren-json-canonicalization-scheme-13#appendix-E That is, some canonicalization issues spill over to the application side as well which you of course do not want. OTOH, the remedy isn't rocket science and the current alternative (dressing messages in Base64Url), is at least as intrusive on applications.

It is possible that Base64 is the final solution but there are folks in the financial sector who are less keen on that. In fact, none of the Open Banking APIs use anything but clear text JSON but they do not use canonicalization, they rather bind to the HTTP body and put detached signature data in HTTP headers. Although working, this greatly complicates serialization, embedding and countersigning.