Closed heartsucker closed 2 years ago
Posting this publicly to serve as documentation should anyone else hit this issue.
Whitespace is not permitted between tokens in Canonical JSON. I don't think CJSON modifies the actual data in the dictionary values; they are permitted to contain anything.
Assuming that "bar\nbaz" is the literal string for the "foo" key, I think B is correct.
I'm curious, why do you claim A is correct?
I imagined this being a writer that streams bytes out, and if you pass it a String
in language X (not a JSON string value), it would write the literal newline and not \n
.
Also, from the definition here:
Strings are uninterpreted bytes, with only two escaped byte values.
and also
string:
""
" chars "
chars:
char
char chars
char:
any byte except hex 22 (") or hex 5C (\)
\\
\"
This means the CJSON is not a subset of JSON (despite the docs claiming it is) since a literal tab may be present in a string as with a literal newline (not \t
and \n
).
So the question is, when you run a string through the encode_canonical
(or whatever an impl calls it) function, are you stream the bytes as the appear in a JSON string or the bytes as they appear in the memory representation?
This means the CJSON is not a subset of JSON (despite the docs claiming it is) since a literal tab may be present in a string as with a literal newline (not \t and \n).
I don't quite follow your objection to the use of the word "subset." CJSON is designed to take valid JSON as input, and modify it so that its output (canonical JSON) produces repeatable hashes. To quote the laptop.org page:
A "canonical JSON" format is provided in order to provide meaningful and repeatable hashes of JSON-encoded data.
If CJSON is given invalid JSON input, then you should expect invalid JSON as output.
Which docs? The linked page at laptop.org simply says, "The grammar for canonical JSON closely matches the grammar presented at json.org: ..." We do use the word "subset" in the TUF specification, so is this the objection? We can change that to use the wording used by laptop.org. Perhaps "restricted dialect" would be a better term to use.
So the question is, when you run a string through the encode_canonical (or whatever an impl calls it) function, are you stream the bytes as the appear in a JSON string or the bytes as they appear in the memory representation?
If there are 8 bytes in a string, the streamer should read 8 uninterpreted bytes out of it and escape any hex 22 (") or hex 5C () that it encounters. You are free to dump JSON in whatever format you wish, to "pretty print" it for example. However, when the hashes are generated, you should run the JSON through CJSON to produce repeatable hashes.
Here's an example using the reference implementation's encode_canonical() and Python's json
module, where the "
is escaped, whitespace is removed, tokens are lexicographically sorted, etc.
>>> import json
>>> import securesystemslib.formats
>>> json_data = {'foo': '1"23', 'bar': '456'}
>>> securesystemslib.formats.encode_canonical(json_data)
u'{"bar":"456","foo":"1\\"23"}'
>>> json.dumps(json_data)
'{"foo": "1\\"23", "bar": "456"}'
>>>
It would be very useful for embedded implementations if the bytes that are fed into the checksum routine can* exactly match the bytes that are sent over the wire.
There would be two advantages:
(*) CAN not MUST: this requires cooperation from the server to generate the JSON in canonical form
The CJSON definition grammar (not the human friendly text that says "CJSON is parseable with a JSON parser") is not compatible with JSON.
Further, the CJSON definition doesn't allow the sequence \n
since the first char \
isn't allow by rule 1, and rules 2 and 3 only explicitly allow \\
and \"
.
(edit: the quote definition of CJSON isn't really even a grammar and should have just used ABNF, and as a result my above claim is wrong)
If we do CJSON the way its done in securesystemslib
, then the spec should redefine canonical JSON and stop referencing the OLPC wiki.
CJSON is a subset of JSON with these restrictions:
Related ticket: https://github.com/advancedtelematic/tuf-test-vectors/issues/42
So it turns out that in https://github.com/theupdateframework/tuf/issues/457#issue-234165143, the choice A actually is what securesystemslib
does, and therefore the output of CJSON is definitely not JSON. See the above related ticket @cajun-rat linked to.
Additional information:
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from securesystemslib.formats import encode_canonical as cjson
>>> import json
>>> json.loads('{"foo": "bar\\nbaz"}')
{'foo': 'bar\nbaz'}
>>> json.loads(cjson(json.loads('{"foo":"bar\\nbaz"}')))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.5/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 12 (char 11)
The input to encode_canonical()
must be raw JSON, which is just a dictionary in Python.
$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> from securesystemslib.formats import encode_canonical as cjson
>>> json.loads(cjson({"foo":"bar\\nbaz"}))
{u'foo': u'bar\\nbaz'}
>>>
encode_canonical()
appears to generate valid JSON that can be loaded by json.loads()
In addition, the format of this example matches choice B.
The behaviour of encode_canonical()
should exactly match what is stated on the OLPC wiki. If it does not, please do point out any discrepancies. Ideally, we'd like to follow the OLPC specification and not invent out own format.
The CJSON definition grammar (not the human friendly text that says "CJSON is parseable with a JSON parser") is not compatible with JSON.
How so? Feel free to link to the JSON specification for the specific violation(s), so I can better understand your objection.
Further, the CJSON definition doesn't allow the sequence \n since the first char \ isn't allow by rule 1, and rules 2 and 3 only explicitly allow \ and \".
I don't follow your reasoning. Is this Rule 1? "Strings are uninterpreted bytes, with only two escaped byte values."
Hmm, I understand this to mean that there are only two bytes that are escaped by CJSON: hex 22 (") and hex 5C (). Not that the backslash is disallowed. In other words, whenever CJSON encounters hex 5C, it escapes it (\\
).
cjson({"foo":"bar\\nbaz"}
Ok, so that's super non-intuitive. When you do a json.dumps
on a dict, it does the escaping for you, so that puts the burden on doing JSON escaping on the caller of the function so that one must JSON-escape strings, then pass them into encode_canonical
.
This also doesn't disprove the point that the output of canonical JSON isn't parseable as JSON.
any byte except hex 22 (
"
) or hex 5C (\
)
That means that 0x0a
(newline) is a valid byte in CJSON. This is contradictory to the JSON RFC which says:
string = quotation-mark *char quotation-mark
char = unescaped /
escape (
%x22 / ; " quotation mark U+0022
%x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 / ; t tab U+0009
%x75 4HEXDIG ) ; uXXXX U+XXXX
escape = %x5C ; \
quotation-mark = %x22 ; "
unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
If inside the TUF lib wherever you're calling encode_canonical
you are first encoding each string value as JSON, then you are double escaping.
>>> from securesystemslib.formats import encode_canonical as cjson
>>> x = 'foo\nbar'
>>> print(x)
foo
bar
>>> print(cjson(x))
"foo
bar"
>>> assert cjson(x) == '"{}"'.format(x)
>>>
If CJSON is a subset of JSON, then the following should hold:
assert json.loads(encode_canonical(some_json)) == json.loads(json.dumps(some_json))
Which is not the case.
Even still, the encode_canonical
function should accept any python object that json.dumps
accepts so long as it doesn't contain any float
s. Preprocessing the input shouldn't be necessary.
This also might explain why I couldn't get the same key IDs as you guys in this comment: https://github.com/theupdateframework/tuf/issues/362#issuecomment-295322447
If CJSON is a subset of JSON, then the following should hold: assert json.loads(encode_canonical(some_json)) == json.dumps(some_json) Which is not the case.
You appear to have mixed up the loads(str) and dumps(obj) functions. json.loads(str)
deserializes str (a str instance containing a JSON document) to a Python object, and json.dumps(obj)
serializes obj to a JSON formatted str.
https://docs.python.org/3.3/library/json.html#json.loads
https://docs.python.org/3.3/library/json.html#json.dumps
In your example, you are comparing a dictionary to a string. I think the following code snippet is the example that you wanted:
>>> obj = {"foo":"bar\\nbaz"}
>>> json.loads(cjson(obj)) == obj
True
Here is the full code:
$ python
Python 3.5.3 (default, Apr 22 2017, 00:00:00)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> from securesystemslib.formats import encode_canonical as cjson
>>> obj = {"foo":"bar\\nbaz"}
>>> json.dumps(obj)
'{"foo": "bar\\\\nbaz"}'
>>> cjson(obj)
'{"foo":"bar\\\\nbaz"}'
>>> json.loads(cjson(obj))
{'foo': 'bar\\nbaz'}
>>> json.loads(cjson(obj)) == obj
True
@vladimir-v-diaz Yeah, I typo'd that and dropped one of the json.loads
because I was pseudocoding and not doing it in the terminal. Anyway.
You're still double escaping things. If I want to the following text to display:
Hello
world
I needs a Python dict like this:
{ 'message': 'Hello\nworld'}
Which can be represented as JSON like this:
{ "message": "Hello\nworld" }
So if I want to canonicalize that JSON, I would do this:
import json
from securesystemslib.formats import encode_canonical
# double escaped because this is a string representing a JSON string
json_string = "{ \"message\": \"Hello\\nworld\" }"
# single escaped because this is just a string
json_dict_ref = {'message': 'Hello\nworld'}
json_loaded = json.loads(json_string)
assert json_loaded == json_dict_ref
print(json_dict_ref['message'])
#Hello
#world
print(json_loaded['message'])
#Hello
#world
json.loads(encode_canonical(json_loaded))
# raises json.decoder.JSONDecodeError: Invalid control character at: line 1 column 18 (char 17)
FWIW: this issue is apparently a result from our home-brewed implementation of canonical json:
import json
from canonicaljson import encode_canonical_json as encode_canonical
# double escaped because this is a string representing a JSON string
json_string = "{ \"message\": \"Hello\\nworld\" }"
# single escaped because this is just a string
json_dict_ref = {'message': 'Hello\nworld'}
json_loaded = json.loads(json_string)
assert json_loaded == json_dict_ref
print(json_dict_ref['message'])
#Hello
#world
print(json_loaded['message'])
#Hello
#world
json.loads(encode_canonical(json_loaded))
# {'message': 'Hello\nworld'}
The PyPI package canonical_json
is wrong. The code in securesystemslib
is (I haven't fully audited it) correct.
Here is a Rust implementation: https://vtllf.org/rustdoc/canonical_json/src/canonical_json/src/ser.rs.html#1-828
const QU: u8 = b'"'; // \x22
const BS: u8 = b'\\'; // \x5C
// Lookup table of escape sequences. A value of b'x' at index i means that byte
// i is escaped as "\x" in JSON. A value of 0 means that byte i is not escaped.
#[cfg_attr(rustfmt, rustfmt_skip)]
static ESCAPE: [u8; 256] = [
// 1 2 3 4 5 6 7 8 9 A B C D E F
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 1
0, 0, QU, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 2
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 3
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 4
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, BS, 0, 0, 0, // 5
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 6
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 7
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 8
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 9
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // A
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // B
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // C
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // D
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // E
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // F
];
There are exactly two escapes, and you are allowed to have completely arbitrary input strings.
There are two answers.
If we ignore the incorrect sentence in the OLPC definition and follow the grammar that is defined on the wiki, then the output is not JSON, and we continue using the OLPC definition regardless.
We define CJSON as:
Canonical JSON is the subset of JSON where:
There is no way to say that the output of the OLPC grammar is JSON. It is not. You are allowed to have arbitrary strings as input to the encode function and (among other things) newlines are not escaped in the output. JSON escapes unicode as \uXXX
, but CJSON says "any byte except \
and "
which actually means that the output of CJSON doesn't even have to be unicode at all (which itself makes for a really terrible spec since the assumption is that it has strings as input).
If we pick Answer 2 (which is the most sane) then all references to the OLPC wiki need to be removed and the securesystemslib
have it's function changed to match. In fact, you can do this with json.dumps(my_json, sort_keys=True, separators=(':', ','))
plus a recursive function that errors on floats.
I don't care if Answer 1 or Answer 2 becomes the spec. I just care that whatever is being used now is not fully understood by everyone and relies on (the OLPC) ambiguous grammar that is in direct contradiction to the first sentence of it's own spec.
@heartsucker
That means that 0x0a (newline) is a valid byte in CJSON. This is contradictory to the JSON RFC which says:
Thanks for the explanation! OLPC's canonical JSON does allow arbitrary bytes in strings for compatibility reasons: "A previous version of this specification required strings to be valid unicode, and relied on JSON's \u escape. This was abandoned as it doesn't allow representing arbitrary binary data in a string, and it doesn't preserve the identity of non-canonical unicode strings."
Although the grammar for OLPC's canonical JSON can produce invalid JSON data if non-unicode bytes are present in strings, it nevertheless still produces valid JSON if only unicode characters are used, and the final JSON is encoded properly (e.g., UTF-8 in our case).
Please note that TUF metadata is in JSON object format, and we canonicalize the valid JSON to calculate digests and signatures.
From the TUF spec:
4.1. Metaformat
All documents use a subset of the JSON object format, with floating-point numbers omitted. When calculating the digest of an object, we use the "canonical JSON" subdialect as described at http://wiki.laptop.org/go/Canonical_JSON
Yes, both json.dumps()
and the canonicaljson
package appear to escape \n
in strings, whereas CJSON preserves characters, with the exception of the single quote and backslash characters.
>>> json_loaded = {'message': 'Hello\nworld'}
>>> json.dumps(json_loaded)
#'{"message": "Hello\\nworld"}'
>>> from canonicaljson import encode_canonical_json as ecj
>>> ecj(json_loaded)
#'{"message":"Hello\\nworld"}'
>>> from securesystemslib.formats import encode_canonical as cjson
>>> cjson(json_loaded)
#u'{"message":"Hello\nworld"}'
Since TUF metadata is not written in canonical JSON nor needs to load CJSON with json.dumps() or json.loads()
, this is not an issue in the reference implementation. We only use CJSON to calculate digests and signatures.
The canonicaljson
module unfortunately doesn't follow all of the restrictions outlined in the OLPC Wiki.
Hey @vladimir-v-diaz. Slow response here. Since it seems that the OLPC grammar is what we want, can we change the TUF spec to not reference it and use a real ABNF to describe the CJSON grammar? This seems like the best way to avoid other people getting as bogged down with this as we did. :)
Also, good point on the unicode normalization. I hadn't thought about that and need to bring this up with the team here.
@heartsucker We can indeed change the TUF specification to include a description. However, is a real ABNF necessary? I don't see why we'd need to mention the grammar. Can't we simply say that its standard JSON with the following restrictions:
Trailing commas in members and elements are not allowed.
This already is neither allowed by the JSON nor the OLPC CJSON grammars.
its standard JSON... Only one 'escape' sequence is defined for strings, and its use is mandatory for quote and backslash.
We should use a formal grammar because from these sentences I can't tell if you've settled on the "JSON-subset" approach of the "OLPC Canonical JSON" approach. A formal grammar makes it unambiguous how things are encoded and escaped.
The JSON restrictions listed come from the OLPC page. I want to list the OLPC restrictions in the specification, and to clear up any confusion about certain parts of it (e.g., strings are unicode, etc.)
The OLPC page might explicitly mention trailing commas because certain parsers, ECMAScript 5 objects/functions, Python objects, etc. do allow it. It might be better to make it clear that trailing commas are disallowed.
The following two JSON linters seem to think that "{'hello': 'world',}"
is valid JSON:
https://codebeautify.org/jsonvalidator
https://jsonlint.com/
If you think a formal grammar is better, please share it so we can judge. I personally think saying that we use standard JSON with certain restrictions is simple and clear. If an implementor wants a grammar, they can just visit json.org.
Note: We might not even discuss JSON in the specification if we generalize the format of Metadata documents. This might just be documentation in the reference implementation if we decide to continue using this format.
From RFC-7159 section 4:
object = begin-object [ member *( value-separator member ) ]
end-object
member = string name-separator value
Trailing commas are not allowed.
Python objects, etc. do allow it
In both cases, there is not a 1-to-1 mapping between Python dicts/arrays/etc. and JSON. It just happens to be that JS/Python simple types look a whole awful lot like JSON, and that's more an artifact of the convenience of representing data as maps, lists, strings, numbers, booleans, and nulls.
x = {
"foo": lambda x: x + 1,
}
Could be "mapped" to JS
x = {
foo: function(x) { return x + 1 },
}
But both of these are not JSON, and there is no meaningful way to convert either to JSON. And in fact, the trailing commas are stripped by the interpreters and don't even exist in memory.
Anyway.
When the doc changes away from JSON only, it should just include a note that the signing format needs to be deterministic with respect to ordering of elements. The implementation details don't actually matter.
When the doc changes away from JSON only, it should just include a note that the signing format needs to be deterministic with respect to ordering of elements.
Yup, this will a be helpful note to have for implementors. In practice, some formats don't always match up exactly across different applications. Sometimes a missing trailing newline can break things.
Two nits:
I'm going to close this: I think the discussion is valuable but there does not seem to be an action we can take within python-tuf. There are securesystemslib and the specification issues linked above, both seem like better places to continue the discussion.
Please open a new issue if there is a possible python-tuf improvement in this area.
We're having an internal debate and we're not sure we're using CJSON in the correct way. Let's say we want to calculate a signature over this:
So what we do is:
Should the output of the function be A or B?
My claim is A is correct, but others claim B is correct. Though in some sense it doesn't really matter which is used so long as all clients use the same (assuming both are deterministic).