jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
29.98k stars 1.55k forks source link

Control formatting of numbers and escape characters #1363

Open seagreen opened 7 years ago

seagreen commented 7 years ago

I'm writing code to serialize JSON in a consistent way (so that the resulting JSON can be reproducibly hashed) and am testing it against jq --compact-output --sort-keys. This works great overall and has already helped me catch bugs in my generator.

However, there are two issues:

Would it improve jq to add options for controlling the escaping of JSON String characters and the formatting of JSON Numbers?

nicowilliams commented 7 years ago

The JSON spec, RFC 7159 says that all codepoints in the ASCII control character range MUST be escaped. There will be no option to not escape them.

As for number formatting options... We're still considering what to do. My preference is to include number sub-type support for 64-bit integers, but with all arithmetic and all math builtins using doubles (IEEE754), with 64-bit integers never formatted using exponents. I'm not really interested in a number formatting option because it's not likely to make such requests go away :( If we add such an option we'll find that we also need a builtin tostring variant that takes this option, and we'll see people who want different formatting options at different paths in their JSON -- I don't think there's a good solution.

seagreen commented 7 years ago

Hey @nicowilliams, thanks for the response. Are you sure DEL has to be escaped? I don't think that's the case.

All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

DEL is all the way up at U+007F. The ABNF supports this as well:

unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

As far as number formatting goes I feel your pain. What most other libraries seem to do is just have a setting for "always scientific notation / sometimes scientific for long numbers / never scientific". While I agree that people would keep asking for more (and switching settings mid-JSON seems like a recipe for messy code) I think that would do a good job of at least providing the basics.

nicowilliams commented 7 years ago

Hmmm, oh good point. Let me double check as to DEL.

nicowilliams commented 7 years ago

Indeed, DEL is NOT required to be escaped. I'm not sure that not escaping it is a good idea, though I suppose very few people will be having DEL embedded in their strings. We wouldn't add an option for that though: it either gets escaped or it doesn't, full stop.

seagreen commented 7 years ago

We wouldn't add an option for that though: it either gets escaped or it doesn't, full stop.

Sounds good, it's your library, you know what set of tradeoffs you're going for.

As you investigate whether or not to escape DEL, let me know if you find anything that particularly sways you one way or another. I'm also trying to make that same decision for my JSON subset right now.

nicowilliams commented 7 years ago

To be clear: it's @stedolan's code. @wtlangford and I maintain it. However, I know that @stedolan wants to minimize the proliferation of command-line switches, and generally would prefer if every command-line option had an equivalent way to implement it in pure jq code (obviously there are some exceptions).

As to DEL, my general preference would be that non-printing characters be escaped for the simple reason that a tty may not be able to display them correctly. One option might be to choose to escape them when we're not doing raw output and the tty is a terminal, but not escape them in any other case. (e.g., --raw-output -j would allow one to implement a spinner using BS and CR, there's no comparable use of DEL, I don't think, but there may be for other Unicode non-printing characters for all I know).

nicowilliams commented 7 years ago

Do you have a use case for not escaping DEL?

seagreen commented 7 years ago

My project is for machine-machine communication only, so the less special cases (and I suppose the less bytes, though I don't care about that too much) the better. For jq if you have to pick one way I would continue escaping it.

nicowilliams commented 7 years ago

Ahh, aha.

My plan is to introduce support for a binary pseudo-type in jq. The idea is that binary data would be efficiently represented internally as byte arrays, but in jq it would be represented as arrays of numbers in the 0..255 range. If you then put jq in raw output mode and output binary then the result should be binary data on stdout (perhaps a new CLI option would be added for that; not sure).

This would work about as you expect: tostring would succeed if the input binary is valid UTF-8, a tobinary would work on string inputs and array-of-small-integers inputs, base64 decoding would produce binary, base64 encoding would accept binary and array-of-small-ineger inputs, an isbinary builtin would return true if the input is either binary or an array-of-small-integers.

nicowilliams commented 7 years ago

@seagreen Do you agree that the binary thing I described above would better suit your needs in general? If so I'd like to close this as a dup of that.

seagreen commented 7 years ago

@nicowilliams: You lost me a little bit there, you may have to explain it slower for me.

Are you mentioning the binary pseudo-type as something that might alleviate the need for my "subset of JSON optimized for machine-machine communication"? Unfortunately that project has to be actual valid JSON for various reasons, so it has to be defined as a sequence of code point, not as any binary.

It sounds like you're set on not having a flag to control how characters are escaped (fyi it wouldn't just be DEL there are other options like whether newline is escaped with \n or with \u000a). So we've got that resolved. I actually can still use jq to test my project despite this, because it's a simple search and replace to change all the DELs in my output (which since it's JSON will only be in strings) into \u007f.

The only remaining thing is my idea for a single "always scientific notation / sometimes scientific for long numbers / never scientific" setting to control number formatting. This is a blocker for using jq for testing, since unlike DEL search and replace on exponential numbers would be a slightly tricky process. You'll have to decide whether the tradeoffs of having that are worth it though.

nicowilliams commented 7 years ago

@seagreen Would you mind testing PR #1327? That might address your number formatting needs.

seagreen commented 7 years ago

Sure thing! Turns out that PR still prints long numbers with scientific notation.

nicowilliams commented 7 years ago

@seagreen #1327 should only use scientific notation for numbers for which it decides to use IEEE754 doubles for the internal representation. That would be: integers which do not fit in 64-bits, non-integer reals, and any numbers which are the result of arithmetic or any math functions.

seagreen commented 7 years ago

@nicowilliams: Gotchya. Serializing all those as non-scientific numbers would still be useful to my testing, but like I said it's up to you guys what options you want to add to jq.

seagreen commented 7 years ago

Here's a link to my project in case seeing what I'm doing helps clarify my questions here. Here's the place where I use jq for testing: https://github.com/seagreen/Son/blob/master/implementation/test/JQ.hs. Even though I had to turn it off when I ran into the issues above, it was still massively helpful while writing the initial code!

klikevil commented 1 year ago

I would like something that can strip control characters from input or simply ignore them, because I have no control over the results i'm receiving from this API.

parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 2, column 14 parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 2, column 14 parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 2, column 14

wader commented 1 year ago

@klikevil you mean strip ANSI control codes from "colorized" JSON etc? maybe something like this:

$ jq -nC '{"a":"hello"}' | jq -rRs 'gsub("\u001b\\[.*?m";"")' | jq
{
  "a": "hello"
}
# or
$ jq -nC '{"a":"hello"}' | jq -rRs 'gsub("\u001b\\[.*?m";"") | fromjson'
{
  "a": "hello"
}
wader commented 1 year ago

Noticed you maybe ment remove the byte range 0x00-0x1f? then could try gsub("[\u0000-\u001f]";"") or gsub("[[:cntrl]]";"")

seagreen commented 1 year ago

(Btw I'm no longer working on this project, so at least for one user you can withdraw this request)