Hello! - Githubissues

Hello!

Thank you for publishing this project.

If you're interested in this sort of thing, you may also be interested in https://github.com/Cyphrme/Coze. Coze also performs JSON canonicalization.

I also had a question:

Why not require the minimal escaping, U+0000 to U+0019 / U+0022 / U+0056?

Thank you,

Zamicol

The escaping is minimal given that we require a valid Unicode document that can safely be interchanged between systems.

The U+D800 to U+DFFF range is used for surrogate pairs in UTF-16 to represent code points beyond U+FFFF, which require more than the 16 bits available in UTF-16 to specify. They are code-points but not characters. Hence, they do not appear in valid Unicode character stream. They should only appear in UTF-16 encodings. UTF-8 and UTF-32 should not use them. (Though in the real world, they do appear in "wobbly" UTF-8 encodings.)

If a system receives a document that contains a single code-point in the range U+D800 to U+DFFF, then it is certainly allowed to replace that character with a U+FFFD � replacement character as the single code point does not represent a character. Python has a system called "surrogate escapes" where invalid UTF-8 is replaced with isolated surrogate pairs to preserve the invalid bytes. Such a system would treat a UTF-8 input with an isolated surrogate pair as three invalid bytes and expand your single code point out to three.

Unicode also has non-characters and private-use characters. These are OK for interchange and hence we do not require them to be escaped.

I'm sure you will have noticed that the Coze canonicalization procedure specifies how fields in objects should be ordered, but not how they should be represented. Many systems lean on the IEEE algorithm for representing 64-bit floating point numbers, but this can lead to problems. For example, the "1000000000000000000" is a valid 64-bit integer, and as an integer would be displayed as that. However, if parsed as a IEEE 64-bit double, then it would be output as "1.0E18". This breaks digital signatures.

I guess the question is why escape and not error for the range U+D800 to U+DFFF? Why not just leave them, or specify that they should be replacement characters?

Escaping the 34 required by the RFC is legitimate. There's also another approach: OLPC escaped only the two characters absolutely required to be escaped, and does not escape anything else. The appeal of this approach is that although that's out of alignment with the JSON RFC, it is in alignment with the original JSON.org spec, and also acknowledges that there was an oopsie with the RFC. The RFC forgot about the other ASCII, and further still, Unicode control characters and non-printable characters. There are 33 ASCII controls and 65 Unicode controls, not 32 as the RFC was originally written.

how fields in objects should be ordered, but not how they should be represented.

In Coze the digest is over bytes. For JSON this is UTF-8 first and then JSON. A message can be verified by Coze and be represented by various systems differently. If fact, Coze signs and verifies digests, and that digest can be of anything. Coze just has some addition rules for messages that are JSON.

I am thinking about adding a stricter JSON canonicalization to CozeX (Coze eXtended), which yes would address representation as well. However, that's a rabbit hole I'm not confident anyone has resolved yet. I'm currently researching that problem. Coze chose its current canonicalization exactly because I wanted to avoid writing representation rules, which are much more extensive than what it currently requires. Any stricter canonicalization would probably be added to CozeX, unless there was a really good reason to add it to core, the main Coze spec.

In my opinion, a canonical JSON format needs to meet at least these goals:

1) Be valid JSON (so not OLPC) 2) Allow all possible JSON data to be represented without loss so, a) no forcing of IEEE floating point, like RFC-8785, b) no changing string representation to replace characters, enforce normalization, or similar c) no barring data, like lone surrogates, as RFC-8785 requires. 3) Be platform agnostic (so no assuming ZMODEM is not being used, for example) 4) In some sense, be faithful to the concept of JSON.

The JSON standard does have quirks, such as why is U+007F (DEL) not escaped when all other 7-bit control codes are?

Escaping U+0000 to U+001F ensures the obvious high risk U+0000 (NUL) which can be interpretted as a text terminator, U+000A (LF) and U+000D (CR) which can be changed due to line end conversions such as between UNIX and Windows, U+0009 (TAB) which can be broken by auto-reformatting. Then there are less obvious cases such as use of the data separator characters U+001C to U+001F to seperate JSON records in a single file, and the fact some data transmission protocols use the control characters as actual control characters.

There are many other characters that could cause problems. Here are some examples:

1) The entire C1 control characters group (U+0080 to U+009F) 2) New line characters U+0085, U+2028, and U+2029 which could be broken by new line standardisation 3) The DEL character U+007F 4) Non-characters, which some implementations consider invalid in character streams, according to the unicode consortium 5) Private-use characters, which could interfere with another systems private use of the private use character 6) Unassigned characters, which could be replaced with the replacement character U+FFFD 7) Annotation markers, which could be stripped. 8) Explicit variant selectors which select undefined or default character variants.

All of these are valid JSON and valid unicode, so we allow them.

Unicode allows all code-points except the lone surrogates to appear in transmissions. Lone surrogates can be valid data though, such as Python's surrogate escapes. That leads us to escaping lone surrogates instead of barring them.

If we ever get a version 2 of this standard, I'd like to see the possibility for configration, where parties can agree additional escaping of characters to acknowledge the limits of their systems and working practices. I would add U+007F to U+009F, U+2028 and U+2029, but no more than that, in my own systems as those have all caused me concerns in real life. I'd also like a configurable limit on the number of trailing zeros on an integer, so values like 1.23E+2, and 1.23E+2000000000 can both be handled sensibly.

In a cryptographic secure system everyone has to agree on the "bytes that were signed", so the path from data to bytes must be precise. Because the internet is what it is, you also have to handle data that breaks the rules, because someone will try to hack you. Is 1E+1000 an integer, a floating point, or an invalid number because it doesn't fit in IEEE 64-bit number? Is 1.0000000000000001 the same as 1? If all systems agree, you are good. If you control all systems, it is easy to make them agree. If you think people might be building their own systems to implement your standards you are the mercy of their assumptions about what you don't specify.

The key benefit of this Canonical JSON specification is that every JSON document has a canonical form. All the other specifications I am aware of limit the data that can be represented.

Thank you for your thoughtful reply!

My plan for this week is to play with control characters in jq. I'd like to better document particular issues with using non-escaped control characters.

Do you have any particular examples with non-escaped control characters, especially those past 32? I'm hoping tonight to start playing around with 'DEL'. I'd suspect there's issues with that to start.

I'm writing this on my phone while waiting for an appointment. I forgot to ask, why doesn't this spec require DEL and the Unicode range controls to be escaped?

The short answer is that if we escape the characters required by JSON, and escape the lone surrogates to respect Unicode, then everything else is technically allowed.

Everything else is trying to protect humans from theoretical confusion by strangely displayed data, or protecting systems from theoretical bugs. Such things are very hard to predict. For example, who would have thought mixing Telugu and Zero-Width-Non-Joiners would brick iPhones? If we try to identify every theoretical problem, we will end up with a lot of extra work for no certain benefit. Is the cost worth it? We cannot know.

However, I think there are grounds for systems to require additional escaping.

DEL can be used to hide data
U+0080 to U+009F are used to control things. You'd think one had to be mad to implement U+009D (Operating System Command), but according to Wikipedia it works in xterm.
Next line, line separator and paragraph separator could be messed up by text editors and break the expected flow of a JSON document.

I'd seriously consider proposing these as a potential variant to the standard as I can see specific clear risks, but I don't think they should be a required part of the standard as there is, technically, no problem with them.

If we are going to look at theoretical problems, we have a lot to consider:

Bidirectional text can confuse humans and hide the real meaning of text. They are known to be used in phishing attacks, so the bidirectional control codes could be risky but on the other hand are required for proper display of Right-To-Left text. If you ban them, humans could be confused by text that is supposed to be RTL but is displayed LTR.
Language Tags - Deprecated for a while and now un-deprecated but without meaning. This sounds like a ripe territory for bugs and so maybe we should protect systems from them?
Variant Selectors - Getting the right skin tone for your emoji is important, but what if you pick a variant that doesn't exist? That sounds like accessing an undefined memory address and that leads to bugs.
Interlinear annotations - affect text display so could hide data from humans.
Private use characters - what if the target system uses the private use character as a control code?
Unallocated code points - what if they are handled strangely?

They come down to two main areas of risk. The first is "humans might be confused". Why are humans looking at the JSON? Users should be looking at the user-interface and that needs to handle the weird characters. Developers, one hopes, know what they are doing.

The second risk area is "Sound like it might cause bugs". If my favourite editor strips language tags that's not the problem of the JSON. If my system uses private use characters as control codes, then that was something we should be predicting here. I would never have put "Telegu + ZWNJ" on my list of risks, and that just shows why this is an impossible task.

So:

If it is required to be escape by a standard, then we should escape it (Hence U+0000 to U+001F and isolated surrogates)
If it has a clearly defined risk, we should escape it as a well documented extension to the standard.
If it has a vague risk, we should keep things simple and do nothing.

Hope that helps.

I asked around to see if anyone knew of an example exploit using the "theoretically risky" characters. We didn't know of any, but the advice I got differed from my "what should the standard say" to be "what would a security professional recommend".

So, this was their response:

Precautionary Approach is the way to go. Experience and history repeatedly show that something that someone one day thinks is "overly cautious", becomes an exploit later.

The longer version is that the removal or escaping of potentially problematic characters, in any system, nearly always has the benefit of reducing the surface area for vulnerabilities, known and unknown. If you adopt this approach, you effectively lower the risk of unexpected behaviour or security issues at the expense of string veracity and processing speed, which aren't excuses not to do it. If there is a potential loss of information / change in meaning of the string, or if processing speed is an issue now, then remember hackers don't sit there and say "Oh, these guys have servers too slow and code that isn't secure, so we won't target them", on the contrary.

With your Standards Approach, as best as I can tell, in both JSON and Unicode standards these characters are "allowed". Relying solely on standards, however, assumes that all systems handling the data also strictly adhere to these standards, which is not always the case (now, or in the future), as again history and experience frequently shows.

So my recommendation would lean towards the precautionary approach.

Your list for the super-cautious approach seems well-considered. It addresses a range of potential issues, from data corruption to security vulnerabilities, and I can't think of anything to add there.

To answer your question, I have not personally come across any instances where language tags, annotations, or variant selectors have been used in successful hacking attempts. However, me being unaware of such cases should not imply the absence of risk. Basically you've jinxed this by discussing it, and you should now consider all of these a security risk, because if you can see a potential security risk, so might someone else.

So given the complexity of IT systems, unknown future changes to these systems, and the likelihood of encountering non-standard behaviours, I would recommend adopting the precautionary approach, which would include escaping the characters you've mentioned.

In the real world, the chosen solution generally becomes a compromise between what the standards guy (me) and the security guy (quoted) recommend.

gibson042 / canonicaljson-spec

Hello! #13