Canonical JSON is unclear

heartsucker commented 7 years ago

We're having an internal debate and we're not sure we're using CJSON in the correct way. Let's say we want to calculate a signature over this:

{
   "quux": 123,
   "foo": "bar\nbaz"
}

So what we do is:

priv = get_my_key()
cjson = encode_canonical(load_json())
sig = sign(priv, cjson)

Should the output of the function be A or B?

A:
{"foo":"bar
baz","quux":123}

B:
{"foo":"bar\nbaz","quux":123}

My claim is A is correct, but others claim B is correct. Though in some sense it doesn't really matter which is used so long as all clients use the same (assuming both are deterministic).

heartsucker commented 7 years ago

Posting this publicly to serve as documentation should anyone else hit this issue.

vladimir-v-diaz commented 7 years ago

Whitespace is not permitted between tokens in Canonical JSON. I don't think CJSON modifies the actual data in the dictionary values; they are permitted to contain anything.

Assuming that "bar\nbaz" is the literal string for the "foo" key, I think B is correct.

I'm curious, why do you claim A is correct?

heartsucker commented 7 years ago

I imagined this being a writer that streams bytes out, and if you pass it a String in language X (not a JSON string value), it would write the literal newline and not \n.

Also, from the definition here:

Strings are uninterpreted bytes, with only two escaped byte values.

and also

string:
   ""
   " chars "
chars:
   char
   char chars
char:
   any byte except hex 22 (") or hex 5C (\)
   \\
   \"

This means the CJSON is not a subset of JSON (despite the docs claiming it is) since a literal tab may be present in a string as with a literal newline (not \t and \n).

So the question is, when you run a string through the encode_canonical (or whatever an impl calls it) function, are you stream the bytes as the appear in a JSON string or the bytes as they appear in the memory representation?

vladimir-v-diaz commented 7 years ago

This means the CJSON is not a subset of JSON (despite the docs claiming it is) since a literal tab may be present in a string as with a literal newline (not \t and \n).

I don't quite follow your objection to the use of the word "subset." CJSON is designed to take valid JSON as input, and modify it so that its output (canonical JSON) produces repeatable hashes. To quote the laptop.org page:

A "canonical JSON" format is provided in order to provide meaningful and repeatable hashes of JSON-encoded data.

If CJSON is given invalid JSON input, then you should expect invalid JSON as output.

Which docs? The linked page at laptop.org simply says, "The grammar for canonical JSON closely matches the grammar presented at json.org: ..." We do use the word "subset" in the TUF specification, so is this the objection? We can change that to use the wording used by laptop.org. Perhaps "restricted dialect" would be a better term to use.

So the question is, when you run a string through the encode_canonical (or whatever an impl calls it) function, are you stream the bytes as the appear in a JSON string or the bytes as they appear in the memory representation?

If there are 8 bytes in a string, the streamer should read 8 uninterpreted bytes out of it and escape any hex 22 (") or hex 5C () that it encounters. You are free to dump JSON in whatever format you wish, to "pretty print" it for example. However, when the hashes are generated, you should run the JSON through CJSON to produce repeatable hashes.

Here's an example using the reference implementation's encode_canonical() and Python's json module, where the " is escaped, whitespace is removed, tokens are lexicographically sorted, etc.

>>> import json
>>> import securesystemslib.formats

>>> json_data = {'foo':     '1"23', 'bar': '456'}

>>> securesystemslib.formats.encode_canonical(json_data)
u'{"bar":"456","foo":"1\\"23"}'

>>> json.dumps(json_data)
'{"foo": "1\\"23", "bar": "456"}'
>>>

cajun-rat commented 7 years ago

It would be very useful for embedded implementations if the bytes that are fed into the checksum routine can* exactly match the bytes that are sent over the wire.

There would be two advantages:

Less code touches the untrusted data. The secure json parser now only has to be able to pull out the start and end positions of the 'signed' section and extract the signature. The full JSON parser only gets invoked after the signature check, and can be removed from the TCB.
The checksum can be calculated without requiring RAM for a second copy of the data (or building a complex and security-critical streaming JSON -> cjson converter).

(*) CAN not MUST: this requires cooperation from the server to generate the JSON in canonical form

heartsucker commented 7 years ago

The CJSON definition grammar (not the human friendly text that says "CJSON is parseable with a JSON parser") is not compatible with JSON.

Further, the CJSON definition doesn't allow the sequence \n since the first char \ isn't allow by rule 1, and rules 2 and 3 only explicitly allow \\ and \".

(edit: the quote definition of CJSON isn't really even a grammar and should have just used ABNF, and as a result my above claim is wrong)

If we do CJSON the way its done in securesystemslib, then the spec should redefine canonical JSON and stop referencing the OLPC wiki.

Proposed Definition

CJSON is a subset of JSON with these restrictions:

No whitespace is allowed outside strings
No floats, only ints
Keys must be sorted according to the bytes that make up the strings (not English / alphabetical)

cajun-rat commented 7 years ago

heartsucker commented 7 years ago

So it turns out that in https://github.com/theupdateframework/tuf/issues/457#issue-234165143, the choice A actually is what securesystemslib does, and therefore the output of CJSON is definitely not JSON. See the above related ticket @cajun-rat linked to.

heartsucker commented 7 years ago

Additional information:

Python 3.5.3 (default, Jan 19 2017, 14:11:04) 
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from securesystemslib.formats import encode_canonical as cjson
>>> import json
>>> json.loads('{"foo": "bar\\nbaz"}')
{'foo': 'bar\nbaz'}
>>> json.loads(cjson(json.loads('{"foo":"bar\\nbaz"}')))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 12 (char 11)

vladimir-v-diaz commented 7 years ago

The input to encode_canonical() must be raw JSON, which is just a dictionary in Python.

$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> from securesystemslib.formats import encode_canonical as cjson
>>> json.loads(cjson({"foo":"bar\\nbaz"}))
{u'foo': u'bar\\nbaz'}
>>>

encode_canonical() appears to generate valid JSON that can be loaded by json.loads() In addition, the format of this example matches choice B.

The behaviour of encode_canonical() should exactly match what is stated on the OLPC wiki. If it does not, please do point out any discrepancies. Ideally, we'd like to follow the OLPC specification and not invent out own format.

The CJSON definition grammar (not the human friendly text that says "CJSON is parseable with a JSON parser") is not compatible with JSON.

How so? Feel free to link to the JSON specification for the specific violation(s), so I can better understand your objection.

Further, the CJSON definition doesn't allow the sequence \n since the first char \ isn't allow by rule 1, and rules 2 and 3 only explicitly allow \ and \".

I don't follow your reasoning. Is this Rule 1? "Strings are uninterpreted bytes, with only two escaped byte values."

Hmm, I understand this to mean that there are only two bytes that are escaped by CJSON: hex 22 (") and hex 5C (). Not that the backslash is disallowed. In other words, whenever CJSON encounters hex 5C, it escapes it (\\).

heartsucker commented 7 years ago

cjson({"foo":"bar\\nbaz"}

Ok, so that's super non-intuitive. When you do a json.dumps on a dict, it does the escaping for you, so that puts the burden on doing JSON escaping on the caller of the function so that one must JSON-escape strings, then pass them into encode_canonical.

This also doesn't disprove the point that the output of canonical JSON isn't parseable as JSON.

any byte except hex 22 (") or hex 5C (\)

That means that 0x0a (newline) is a valid byte in CJSON. This is contradictory to the JSON RFC which says:

string = quotation-mark *char quotation-mark
char = unescaped /
   escape (
       %x22 /          ; "    quotation mark  U+0022
       %x5C /          ; \    reverse solidus U+005C
       %x2F /          ; /    solidus         U+002F
       %x62 /          ; b    backspace       U+0008
       %x66 /          ; f    form feed       U+000C
       %x6E /          ; n    line feed       U+000A
       %x72 /          ; r    carriage return U+000D
       %x74 /          ; t    tab             U+0009
       %x75 4HEXDIG )  ; uXXXX                U+XXXX
escape = %x5C              ; \
quotation-mark = %x22      ; "
unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

If inside the TUF lib wherever you're calling encode_canonical you are first encoding each string value as JSON, then you are double escaping.

>>> from securesystemslib.formats import encode_canonical as cjson
>>> x = 'foo\nbar'
>>> print(x)
foo
bar
>>> print(cjson(x))
"foo
bar"
>>> assert cjson(x) == '"{}"'.format(x)
>>>

If CJSON is a subset of JSON, then the following should hold:

assert json.loads(encode_canonical(some_json)) == json.loads(json.dumps(some_json))

Which is not the case.

Even still, the encode_canonical function should accept any python object that json.dumps accepts so long as it doesn't contain any floats. Preprocessing the input shouldn't be necessary.

heartsucker commented 7 years ago

This also might explain why I couldn't get the same key IDs as you guys in this comment: https://github.com/theupdateframework/tuf/issues/362#issuecomment-295322447

vladimir-v-diaz commented 7 years ago

If CJSON is a subset of JSON, then the following should hold: assert json.loads(encode_canonical(some_json)) == json.dumps(some_json) Which is not the case.

You appear to have mixed up the loads(str) and dumps(obj) functions. json.loads(str) deserializes str (a str instance containing a JSON document) to a Python object, and json.dumps(obj) serializes obj to a JSON formatted str. https://docs.python.org/3.3/library/json.html#json.loads https://docs.python.org/3.3/library/json.html#json.dumps

In your example, you are comparing a dictionary to a string. I think the following code snippet is the example that you wanted:

>>> obj = {"foo":"bar\\nbaz"}
>>> json.loads(cjson(obj)) == obj
True

Here is the full code:

$ python
Python 3.5.3 (default, Apr 22 2017, 00:00:00) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import json
>>> from securesystemslib.formats import encode_canonical as cjson

>>> obj = {"foo":"bar\\nbaz"}
>>> json.dumps(obj)
'{"foo": "bar\\\\nbaz"}'
>>> cjson(obj)
'{"foo":"bar\\\\nbaz"}'
>>> json.loads(cjson(obj))
{'foo': 'bar\\nbaz'}
>>> json.loads(cjson(obj)) == obj
True

heartsucker commented 7 years ago

@vladimir-v-diaz Yeah, I typo'd that and dropped one of the json.loads because I was pseudocoding and not doing it in the terminal. Anyway.

You're still double escaping things. If I want to the following text to display:

Hello
world

I needs a Python dict like this:

{ 'message': 'Hello\nworld'}

Which can be represented as JSON like this:

{ "message": "Hello\nworld" }

So if I want to canonicalize that JSON, I would do this:

import json
from securesystemslib.formats import encode_canonical
# double escaped because this is a string representing a JSON string
json_string = "{ \"message\": \"Hello\\nworld\" }"
# single escaped because this is just a string 
json_dict_ref = {'message': 'Hello\nworld'}
json_loaded = json.loads(json_string)
assert json_loaded == json_dict_ref

print(json_dict_ref['message'])
#Hello
#world

print(json_loaded['message'])
#Hello
#world

json.loads(encode_canonical(json_loaded))
# raises json.decoder.JSONDecodeError: Invalid control character at: line 1 column 18 (char 17)

SantiagoTorres commented 7 years ago

FWIW: this issue is apparently a result from our home-brewed implementation of canonical json:

import json
from canonicaljson import encode_canonical_json as encode_canonical
# double escaped because this is a string representing a JSON string
json_string = "{ \"message\": \"Hello\\nworld\" }"
# single escaped because this is just a string 
json_dict_ref = {'message': 'Hello\nworld'}
json_loaded = json.loads(json_string)
assert json_loaded == json_dict_ref

print(json_dict_ref['message'])
#Hello
#world

print(json_loaded['message'])
#Hello
#world

json.loads(encode_canonical(json_loaded))
# {'message': 'Hello\nworld'}

heartsucker commented 7 years ago

The PyPI package canonical_json is wrong. The code in securesystemslib is (I haven't fully audited it) correct.

Here is a Rust implementation: https://vtllf.org/rustdoc/canonical_json/src/canonical_json/src/ser.rs.html#1-828

const QU: u8 = b'"';  // \x22
const BS: u8 = b'\\'; // \x5C

// Lookup table of escape sequences. A value of b'x' at index i means that byte
// i is escaped as "\x" in JSON. A value of 0 means that byte i is not escaped.
#[cfg_attr(rustfmt, rustfmt_skip)]
static ESCAPE: [u8; 256] = [
    //  1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // 0
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // 1
    0,  0, QU,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // 2
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // 3
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // 4
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, BS,  0,  0,  0, // 5
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // 6
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // 7
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // 8
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // 9
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // A
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // B
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // C
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // D
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // E
    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, // F
];

There are exactly two escapes, and you are allowed to have completely arbitrary input strings.

There are two answers.

Answer 1

If we ignore the incorrect sentence in the OLPC definition and follow the grammar that is defined on the wiki, then the output is not JSON, and we continue using the OLPC definition regardless.

Answer 2

We define CJSON as:

Canonical JSON is the subset of JSON where:

Floats are not allowed
There is no whitespace outside of strings
Keys in objects are ordered by their byte representations

There is no way to say that the output of the OLPC grammar is JSON. It is not. You are allowed to have arbitrary strings as input to the encode function and (among other things) newlines are not escaped in the output. JSON escapes unicode as \uXXX, but CJSON says "any byte except \ and " which actually means that the output of CJSON doesn't even have to be unicode at all (which itself makes for a really terrible spec since the assumption is that it has strings as input).

If we pick Answer 2 (which is the most sane) then all references to the OLPC wiki need to be removed and the securesystemslib have it's function changed to match. In fact, you can do this with json.dumps(my_json, sort_keys=True, separators=(':', ',')) plus a recursive function that errors on floats.

I don't care if Answer 1 or Answer 2 becomes the spec. I just care that whatever is being used now is not fully understood by everyone and relies on (the OLPC) ambiguous grammar that is in direct contradiction to the first sentence of it's own spec.

vladimir-v-diaz commented 7 years ago

@heartsucker

That means that 0x0a (newline) is a valid byte in CJSON. This is contradictory to the JSON RFC which says:

Thanks for the explanation! OLPC's canonical JSON does allow arbitrary bytes in strings for compatibility reasons: "A previous version of this specification required strings to be valid unicode, and relied on JSON's \u escape. This was abandoned as it doesn't allow representing arbitrary binary data in a string, and it doesn't preserve the identity of non-canonical unicode strings."

Although the grammar for OLPC's canonical JSON can produce invalid JSON data if non-unicode bytes are present in strings, it nevertheless still produces valid JSON if only unicode characters are used, and the final JSON is encoded properly (e.g., UTF-8 in our case).

Please note that TUF metadata is in JSON object format, and we canonicalize the valid JSON to calculate digests and signatures.

From the TUF spec:

4.1. Metaformat

All documents use a subset of the JSON object format, with floating-point numbers omitted. When calculating the digest of an object, we use the "canonical JSON" subdialect as described at http://wiki.laptop.org/go/Canonical_JSON

vladimir-v-diaz commented 7 years ago

Yes, both json.dumps() and the canonicaljson package appear to escape \n in strings, whereas CJSON preserves characters, with the exception of the single quote and backslash characters.

>>> json_loaded = {'message': 'Hello\nworld'}
>>> json.dumps(json_loaded)
#'{"message": "Hello\\nworld"}'

>>> from canonicaljson import encode_canonical_json as ecj
>>> ecj(json_loaded)
#'{"message":"Hello\\nworld"}'

>>> from securesystemslib.formats import encode_canonical as cjson
>>> cjson(json_loaded)
#u'{"message":"Hello\nworld"}'

Since TUF metadata is not written in canonical JSON nor needs to load CJSON with json.dumps() or json.loads(), this is not an issue in the reference implementation. We only use CJSON to calculate digests and signatures.

The canonicaljson module unfortunately doesn't follow all of the restrictions outlined in the OLPC Wiki.

heartsucker commented 7 years ago

Hey @vladimir-v-diaz. Slow response here. Since it seems that the OLPC grammar is what we want, can we change the TUF spec to not reference it and use a real ABNF to describe the CJSON grammar? This seems like the best way to avoid other people getting as bogged down with this as we did. :)

Also, good point on the unicode normalization. I hadn't thought about that and need to bring this up with the team here.

vladimir-v-diaz commented 7 years ago

@heartsucker We can indeed change the TUF specification to include a description. However, is a real ABNF necessary? I don't see why we'd need to mention the grammar. Can't we simply say that its standard JSON with the following restrictions:

Floating point numbers are not allowed in JSON. Neither are leading zeros or "minus 0" for integers.
All map keys must be quoted, and must appear in sorted order.
All whitespace is eliminated.
Trailing commas in members and elements are not allowed.
Only one 'escape' sequence is defined for strings, and its use is mandatory for quote and backslash.
Keys must be in lexicographically sorted order.
etc.

heartsucker commented 7 years ago

Trailing commas in members and elements are not allowed.

This already is neither allowed by the JSON nor the OLPC CJSON grammars.

its standard JSON... Only one 'escape' sequence is defined for strings, and its use is mandatory for quote and backslash.

We should use a formal grammar because from these sentences I can't tell if you've settled on the "JSON-subset" approach of the "OLPC Canonical JSON" approach. A formal grammar makes it unambiguous how things are encoded and escaped.

vladimir-v-diaz commented 7 years ago

The JSON restrictions listed come from the OLPC page. I want to list the OLPC restrictions in the specification, and to clear up any confusion about certain parts of it (e.g., strings are unicode, etc.)

The OLPC page might explicitly mention trailing commas because certain parsers, ECMAScript 5 objects/functions, Python objects, etc. do allow it. It might be better to make it clear that trailing commas are disallowed.

The following two JSON linters seem to think that "{'hello': 'world',}" is valid JSON: https://codebeautify.org/jsonvalidator https://jsonlint.com/

If you think a formal grammar is better, please share it so we can judge. I personally think saying that we use standard JSON with certain restrictions is simple and clear. If an implementor wants a grammar, they can just visit json.org.

vladimir-v-diaz commented 7 years ago

Note: We might not even discuss JSON in the specification if we generalize the format of Metadata documents. This might just be documentation in the reference implementation if we decide to continue using this format.

heartsucker commented 7 years ago

From RFC-7159 section 4:

object = begin-object [ member *( value-separator member ) ]
         end-object
member = string name-separator value

Trailing commas are not allowed.

Python objects, etc. do allow it

In both cases, there is not a 1-to-1 mapping between Python dicts/arrays/etc. and JSON. It just happens to be that JS/Python simple types look a whole awful lot like JSON, and that's more an artifact of the convenience of representing data as maps, lists, strings, numbers, booleans, and nulls.

x = {
  "foo": lambda x: x + 1,
}

Could be "mapped" to JS

x = {
  foo: function(x) { return x + 1 },
}

But both of these are not JSON, and there is no meaningful way to convert either to JSON. And in fact, the trailing commas are stripped by the interpreters and don't even exist in memory.

Anyway.

When the doc changes away from JSON only, it should just include a note that the signing format needs to be deterministic with respect to ordering of elements. The implementation details don't actually matter.

vladimir-v-diaz commented 7 years ago

When the doc changes away from JSON only, it should just include a note that the signing format needs to be deterministic with respect to ordering of elements.

Yup, this will a be helpful note to have for implementors. In practice, some formats don't always match up exactly across different applications. Sometimes a missing trailing newline can break things.

alexshpilkin commented 6 years ago

Two nits:

if you go the “JSON subset” route, you might need to define keys not only be sorted, but to be unique as well: the JSON spec is deliberately vague about duplicate keys in objects—in particular, it does not prohibit them;
trailing newlines: note that POSIX defines a “text file” to end with a newline, and e.g. every standard POSIX utility (grep, sed, ...) is allowed to break on files that don’t end with one—if you write CJSON to files, you might want to specify it to include a single final newline.

jku commented 2 years ago

I'm going to close this: I think the discussion is valuable but there does not seem to be an action we can take within python-tuf. There are securesystemslib and the specification issues linked above, both seem like better places to continue the discussion.

Please open a new issue if there is a possible python-tuf improvement in this area.

theupdateframework / python-tuf