Closed jmblog closed 1 year ago
I've started on implementing delayed parsing of numbers so as to preserve their original form wherever possible (i.e., whenever the actual number isn't needed as a double in the jq program). It turns out that this will require a lot of work :( Even then we'll need a bignum library to support bignum math in jq.
I'm not entirely sure it's wrong to do so. What format would you prefer?
In term of math it is not wrong to write 1e+07 instead of 10000000.
But in terms of software it makes a big difference: Consider a Unix pipe (my usecase) like
% some program | jq 'a cool filter' | some other program B
Now jq's output is fed this to program B, which can deal with int64 or even int128, but not with floats, because in the original data there are no floats.
When jq makes a conversion like above, program B bails out.
See: https://github.com/stedolan/jq/issues/143#issuecomment-28922479
To answer your question: I would prefer if jq would not change the representation of a number, if this number is just moved from jq input to jq output.
(like "sort -g", it does interpret the input numbers but still outputs the original lines.)
@tischwa +1
It'd be a lot easier to provide options for how numbers are formatted on output than to preserve input form (when not touched by arithmetic).
I've started on implementing delayed parsing of numbers so as to preserve their original form wherever possible (i.e., whenever the actual number isn't needed as a double in the jq program). It turns out that this will require a lot of work :( Even then we'll need a bignum library to support bignum math in jq.
@nicowilliams I already did this, in my fork: https://github.com/airfrog/jq I can send you a pull request if you want to incorporate it into the main branch.
Wouldn't it be possible to keep for each parsed number not only the numeric value, but also the input string? If nothing is assigned to that numeric field during filtering, the input string could be output as is. If a number is created in an arithmeic expression, the string is empty and the number is output according to some formatting option.
I remember this https://github.com/stedolan/jq/issues/143#issuecomment-28311876 where airfrog seemed to have something similar working. (Ahh, I just saw he sent a pull request.)
I think in terms of universal usability jq would gain a lot, if it would follow the typical philosophy of the classical Unix filter-like programs, which only modify the input, if they have to. Examles:
% cat num.txt 111111111111111111 222222222222222222
% jq '.' num.txt 111111111111111100 222222222222222200
% awk '{print $1, 1*$1}' num.txt 111111111111111111 111111111111111104 222222222222222222 222222222222222208
% sort -n num.txt 111111111111111111 222222222222222222
% sort -g num.txt 111111111111111111 222222222222222222
So usually the input is fed through, only if awk has to do the computation 1*$1 it switches internally to a numeric representation, the plain $1 is printed exactly as given in the input. Also sort -n/-g has to interpret the lines numerically but still gives the original input as output.
+1
jq does have David M. Gay's bigint code in jv_dtoa.c
. Perhaps it should use more of it. It's thread-safe, and the jv_dtoa_context stuff is really for caching reusable things -- an optimization we could remove if it made things easier. This is clearly more complete than libtomfloat for some things, namely: parsing and formatting numbers, as well as big2double and double2big conversions (which will be needed for API backwards compatibility reasons, and to be able to use libm functions). But it's also less complete for other things: fewer arithmetic operations are implemented (e.g., there's no divide, just a ratio()
that returns a double). Either there's a lot of work to do on either codebase, or we find another, more complete library. Ideas?
OTOH, jq maybe doesn't need bignum operations, just a bignum representation falling back to doubles for arithmetic (and comparison?). But I'd prefer to only fallback to doubles for libm functions for which we find no better alternative.
@nicowilliams wrote:
OTOH, jq maybe doesn't need bignum operations, just a bignum representation ....
bignum operations for the medium or long term; bignum representation for the short term (or tomorrow :-)
Well, it's early days and there's still research to be done.
http://www.eskimo.com/~eresrch/float/ looks promising, though I've no idea what the license on it would be (I sent the author email about this). It's very complete, but a) it's fixed-precision (probably easy to change to be dynamic) and b) it doesn't handle normal string representation of numbers (probably also easy to fix). I haven't looked but I suspect it also doesn't do double2big and big2double conversion.
The author of Big Float (http://www.eskimo.com/~eresrch/float/) has agreed to let us use it under friendly terms. I'll take a look at it and see how suitable it is.
Would be really really nice to have some progress here. I've just been bitten by this bug and it's really hard to catch as the numbers jq spits out looks totally legid - i.e. within the same magnitude
@kutzi (and anyone else interested in bignum support in jq) We have a PR (that I need to find time to finish) for 64-bit integer support (in addition to IEEE754 doubles). We're not likely to add any kind of bignums unless someone submits a PR. If you or anyone else wants to work on bignum support for jq, you'll need to be aware of a couple of things:
jv_free()
values known to be numbers, and valgrind can't check for that anyways given that numbers today are never allocatedI'd start by adding a compile-time option to use allocated numbers so that numeric jv
s point to a malloc()
ed double
, that way failures to jv_free()
numbers can be caught by valgrind and fixed.
Next I'd look for a suitable bignum library. There are quite a number of them, but they'd have to be a) C-coded or otherwise have a C API, b) licensed in a way that's friendly to jq's license and jq's users.
Lastly, I'd integrate such a bignum library much like Oniguruma: as a [git] submodule that is used if ./configure
can't find it installed or if the user wants the submodule used.
Any updates?
Would this be the right issue to ask about having jq preserve zeros after the decimal point as well, e.g. not converting 5.0
to 5
? JSON is agnostic about the semantics of numbers, but the programs using the JSON may very well care to differentiate between 500
, 500.0
, and 5e2
(it might be doing precision-based calculations, for instance).
As a concrete example Java's BigDecimal
class keeps track of both the unscaled value and the scale. So these could be reasonably be differentiated as different values when read by a Java program:
System.out.println(new BigDecimal("500"));
System.out.println(new BigDecimal("500.0"));
System.out.println(new BigDecimal("5e2"));
System.out.println();
System.out.println(new BigDecimal(BigInteger.valueOf(500), 0));
System.out.println(new BigDecimal(BigInteger.valueOf(5000), 1));
System.out.println(new BigDecimal(BigInteger.valueOf(5), -2));
=>
500
500.0
5E+2
500
500.0
5E+2
$ echo '[500, 500.0, 5e2]' | jq -c
[500,500,500]
jq 1.7 released with support for literal large numbers. closing
Awesome, thank you!
There are cases that jq converts extra large numbers to ones with scientific (exponent) notation.