Open seagreen opened 7 years ago
The JSON spec, RFC 7159 says that all codepoints in the ASCII control character range MUST be escaped. There will be no option to not escape them.
As for number formatting options... We're still considering what to do. My preference is to include number sub-type support for 64-bit integers, but with all arithmetic and all math builtins using doubles (IEEE754), with 64-bit integers never formatted using exponents. I'm not really interested in a number formatting option because it's not likely to make such requests go away :( If we add such an option we'll find that we also need a builtin tostring
variant that takes this option, and we'll see people who want different formatting options at different paths in their JSON -- I don't think there's a good solution.
Hey @nicowilliams, thanks for the response. Are you sure DEL has to be escaped? I don't think that's the case.
All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
DEL is all the way up at U+007F. The ABNF supports this as well:
unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
As far as number formatting goes I feel your pain. What most other libraries seem to do is just have a setting for "always scientific notation / sometimes scientific for long numbers / never scientific". While I agree that people would keep asking for more (and switching settings mid-JSON seems like a recipe for messy code) I think that would do a good job of at least providing the basics.
Hmmm, oh good point. Let me double check as to DEL
.
Indeed, DEL
is NOT required to be escaped. I'm not sure that not escaping it is a good idea, though I suppose very few people will be having DEL
embedded in their strings. We wouldn't add an option for that though: it either gets escaped or it doesn't, full stop.
We wouldn't add an option for that though: it either gets escaped or it doesn't, full stop.
Sounds good, it's your library, you know what set of tradeoffs you're going for.
As you investigate whether or not to escape DEL, let me know if you find anything that particularly sways you one way or another. I'm also trying to make that same decision for my JSON subset right now.
To be clear: it's @stedolan's code. @wtlangford and I maintain it. However, I know that @stedolan wants to minimize the proliferation of command-line switches, and generally would prefer if every command-line option had an equivalent way to implement it in pure jq code (obviously there are some exceptions).
As to DEL
, my general preference would be that non-printing characters be escaped for the simple reason that a tty may not be able to display them correctly. One option might be to choose to escape them when we're not doing raw output and the tty is a terminal, but not escape them in any other case. (e.g., --raw-output -j
would allow one to implement a spinner using BS
and CR
, there's no comparable use of DEL
, I don't think, but there may be for other Unicode non-printing characters for all I know).
Do you have a use case for not escaping DEL
?
My project is for machine-machine communication only, so the less special cases (and I suppose the less bytes, though I don't care about that too much) the better. For jq
if you have to pick one way I would continue escaping it.
Ahh, aha.
My plan is to introduce support for a binary pseudo-type in jq. The idea is that binary data would be efficiently represented internally as byte arrays, but in jq it would be represented as arrays of numbers in the 0..255 range. If you then put jq in raw output mode and output binary then the result should be binary data on stdout (perhaps a new CLI option would be added for that; not sure).
This would work about as you expect: tostring
would succeed if the input binary is valid UTF-8, a tobinary
would work on string inputs and array-of-small-integers inputs, base64 decoding would produce binary, base64 encoding would accept binary and array-of-small-ineger inputs, an isbinary
builtin would return true if the input is either binary or an array-of-small-integers.
@seagreen Do you agree that the binary thing I described above would better suit your needs in general? If so I'd like to close this as a dup of that.
@nicowilliams: You lost me a little bit there, you may have to explain it slower for me.
Are you mentioning the binary pseudo-type as something that might alleviate the need for my "subset of JSON optimized for machine-machine communication"? Unfortunately that project has to be actual valid JSON for various reasons, so it has to be defined as a sequence of code point, not as any binary.
It sounds like you're set on not having a flag to control how characters are escaped (fyi it wouldn't just be DEL
there are other options like whether newline is escaped with \n
or with \u000a
). So we've got that resolved. I actually can still use jq
to test my project despite this, because it's a simple search and replace to change all the DELs in my output (which since it's JSON will only be in strings) into \u007f
.
The only remaining thing is my idea for a single "always scientific notation / sometimes scientific for long numbers / never scientific" setting to control number formatting. This is a blocker for using jq
for testing, since unlike DEL search and replace on exponential numbers would be a slightly tricky process. You'll have to decide whether the tradeoffs of having that are worth it though.
@seagreen Would you mind testing PR #1327? That might address your number formatting needs.
Sure thing! Turns out that PR still prints long numbers with scientific notation.
@seagreen #1327 should only use scientific notation for numbers for which it decides to use IEEE754 doubles for the internal representation. That would be: integers which do not fit in 64-bits, non-integer reals, and any numbers which are the result of arithmetic or any math functions.
@nicowilliams: Gotchya. Serializing all those as non-scientific numbers would still be useful to my testing, but like I said it's up to you guys what options you want to add to jq
.
Here's a link to my project in case seeing what I'm doing helps clarify my questions here. Here's the place where I use jq
for testing: https://github.com/seagreen/Son/blob/master/implementation/test/JQ.hs. Even though I had to turn it off when I ran into the issues above, it was still massively helpful while writing the initial code!
I would like something that can strip control characters from input or simply ignore them, because I have no control over the results i'm receiving from this API.
parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 2, column 14 parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 2, column 14 parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 2, column 14
@klikevil you mean strip ANSI control codes from "colorized" JSON etc? maybe something like this:
$ jq -nC '{"a":"hello"}' | jq -rRs 'gsub("\u001b\\[.*?m";"")' | jq
{
"a": "hello"
}
# or
$ jq -nC '{"a":"hello"}' | jq -rRs 'gsub("\u001b\\[.*?m";"") | fromjson'
{
"a": "hello"
}
Noticed you maybe ment remove the byte range 0x00-0x1f? then could try gsub("[\u0000-\u001f]";"")
or gsub("[[:cntrl]]";"")
(Btw I'm no longer working on this project, so at least for one user you can withdraw this request)
I'm writing code to serialize JSON in a consistent way (so that the resulting JSON can be reproducibly hashed) and am testing it against
jq --compact-output --sort-keys
. This works great overall and has already helped me catch bugs in my generator.However, there are two issues:
I can't control whether characters in JSON Strings are escaped or not. Particularly, I'd like to serialize
U+007F
(DEL) without escaping, butjq
escapes it.I can't control whether or not exponential notation is used for numbers.
jq
switches to exponential notation for large numbers like 10000000000000000, and in my particular case I'd like to turn that off.Would it improve
jq
to add options for controlling the escaping of JSON String characters and the formatting of JSON Numbers?