jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.47k stars 1.58k forks source link

Mishandling of large numbers #1959

Closed mendess closed 1 year ago

mendess commented 5 years ago

Describe the bug Jq fails to parse and output large numbers, mangling the number on the output.

To Reproduce For the input json:

{"key":418502930602131457}

jq . test.json produces this:

{
  "key": 418502930602131460
}

For the input:

{"key":489819608690327552}

jq . test.json produces this:

{
  "key": 489819608690327550
}

Expected behavior The numbers should not change.

Environment:

Additional Notes The output number seems to always end in 0, but sometimes it seems to "round". Refer to both of the provided examples.

wtlangford commented 5 years ago

This is a known issue (see https://github.com/stedolan/jq/issues?utf8=%E2%9C%93&q=is%3Aissue+label%3Aieee754 for a history on the topic).

In short, jq uses IEEE754 doubles to store numbers (which is permitted by the JSON specification). This means that very large integers might get adjusted to the nearest representable value. (Even if you weren't using jq, it's possible some other tool you use would do this to you)

There's a PR (https://github.com/stedolan/jq/pull/1752) adding support for large numbers, but it comes with some performance penalties, and we haven't had the time to get it merged yet. Hopefully Soon™.

I generally suggest that a large number that you aren't doing math on is actually a string. If you're able to represent it as a string instead of a number, then that's your best bet until we get that big number support merged.

mendess commented 5 years ago

Thanks for the quick response and sorry for the duplicate issue, should I close it?

The numbers I am working with used to be strings actually but we switched to a more typesafe language (Rust) and having them be actual numbers was more ergonomic. Good luck with the PR :)

cblp commented 5 years ago

jq uses IEEE754 doubles to store numbers (which is permitted by the JSON specification)

Please show where the standard permits it.

ECMA-404:

JSON is agnostic about the semantics of numbers. In any programming language, there can be a variety of number types of various capacities and complements, fixed or floating, binary or decimal. That can make interchange between different programming languages difficult. JSON instead offers only the representation of numbers that humans use: a sequence of digits. All programming languages know how to make sense of digit sequences even if they disagree on internal representations. That is enough to allow interchange.

cblp commented 5 years ago

Alternative spec, RFC 7159:

This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

I think, "limits on numbers accepted" can mean rejection of some values, but not an alteration of textual representation.

pkoppstein commented 5 years ago

RFC 8259 in the Parsers section says:

A JSON parser MUST accept all texts that conform to the JSON grammar.
... An implementation may set limits on the range and precision of numbers.

jq sets limits, and accepts valid texts. If it raised. an error if a limit was violated, it would violate the “must accept” requirement, wouldn’t it? So it seems to me there’s good reason to be unhappy about the mishmash that became of Crockford’s original intention.

https://tools.ietf.org/html/rfc8259#page-10

cblp commented 5 years ago

@pkoppstein, jq doesn't transform JSON into another representation, it is not a parser.

wtlangford commented 5 years ago

jq doesn't transform JSON into another representation, it is not a parser.

jq does have a parser. How else would it transform JSON text (which is "a text format for the serialization of structured data") into actual values/data that it can run your program on? jq also has a generator (as defined by the RFC), in that it produces JSON texts which strictly conform to the standard.

I think, "limits on numbers accepted" can mean rejection of some values, but not an alteration of textual representation.

The issue here is that to our parser and generator, the JSON texts for these "altered" numbers represent the same value (because to us, values are IEEE754 doubles, and those have precision issues for very large and very small numbers).

We're aware this is a pain point for people. Lots of tools output very large integer IDs, and from the perspective of a user, jq ends up mangling those IDs. As I mentioned above, we have a PR (#1752) in progress that will add some big number support to jq, at the cost of some performance. We (the maintainers) haven't had time to finalize and merge it yet, but it's high on our jq priority list.

pkoppstein commented 5 years ago

@cblp - jq includes a parser. The architecture precludes the non-parser part of jq from seeing the input as anything other than what the parser reveals. As I’ve already indicated, we’re all aware of the problem, so whether you blame jq’s architecture or the state of JSON requirements seems somewhat pointless.

cblp commented 5 years ago

Yes, jq has a parser inside, but it's an implementation detail, below the abstraction line. The user does not expect jq to behave like a parser.

Besides that, usage of double in that parser is unnecessary and harmful, as we see. It is possible to parse JSON without modifying it.

So, the reference to "parser" doesn't make this behavior valid.

On Mon, 19 Aug 2019, 19:36 pkoppstein, notifications@github.com wrote:

@cblp https://github.com/cblp - jq includes a parser. The architecture precludes the non-parser part of jq from seeing the input as anything other than what the parser reveals. As I’ve already indicated, we’re all aware of the problem, so whether you blame jq’s architecture or the state of JSON requirements seems somewhat pointless.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stedolan/jq/issues/1959?email_source=notifications&email_token=AAAPQB5VIVQ27NVVUDTKAWDQFLD2NA5CNFSM4IKNXMQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4TRSFA#issuecomment-522656020, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAPQB3YX7RVIS7ZDT2W2V3QFLD2NANCNFSM4IKNXMQQ .

nicowilliams commented 5 years ago

@cblp a user who only ever runs jq . probably doesn't expect jq to be a JSON parser. Any user who writes jq programs more complex than . will understand (on some level) that jq is indeed a JSON parser with an internal representation. At any rate, jq really does parse JSON into an internal representation.

The next release of jq will have better range and precision for numerics.

cblp commented 5 years ago

I'm sorry for looking blaming, I just wanted to clarify the reason for current design.

pkoppstein commented 5 years ago

@cblp - If there is a reason, it probably is some mix of a desire to achieve efficiency in a quick and simple way, a sense that the JSON spec allows implementations to set limits, and perhaps a sense or belief that in practice, the issue is relatively unimportant. Feel free to assign whichever weights you like :-)

wjmelements commented 4 years ago

I am still seeing this for big integers. For example, 5474205234507702943235 becomes 5474205234507703000000. The difference is meaningful for me and prevents me from using jq.

pkoppstein commented 4 years ago

@wjmelements - Good news! The issue has been addressed in the "master" version of jq:

$ jqMaster --version
jq-1.6-107-g24564b2

$ jqMaster -n 5474205234507702943235
5474205234507702943235

The enhancement dates from Oct 19, 2019, which is after the release of jq 1.6.

jprupp commented 3 years ago

I have the version that comes with Fedora 34, and it still has this issue: it cannot handle large numbers. I would love it if it could.

AlaaHamoudah commented 3 years ago

This is a very serious issue, I would really appreciate if this issue is fixed

scottyob commented 3 years ago

+1 on this. Can also confirm master helps our one use case:

abc@202d6ad3f0b5:/tmp/jq$ ./jq --version
jq-1.6
abc@202d6ad3f0b5:/tmp/jq$ echo '{"a":9011153322235679}' | ./jq '.a'
9011153322235680

abc@202d6ad3f0b5:/tmp/jq/jq$ ./jq --version
jq-1.6-137-gd18b2d0-dirty
abc@202d6ad3f0b5:/tmp/jq/jq$ echo '{"a":9011153322235679}' | ./jq '.a'
9011153322235679
wpietri commented 3 years ago

If fixing this is a problem, then perhaps you could make it blow up when it mangles data? We're having an issue where somebody used JQ at the beginning of a research project. It silently corrupted a bunch of ids, so the downstream work now has to be redone or hackily fixed.

This means we can't really trust jq unless we carefully validate all the data as jq-safe. So in practice it looks like we'll just have to stop using jq. It's a lovely tool, but not so lovely that we want to end up looking like fools by coming to the wrong conclusion.

scottyob commented 3 years ago

@wpietri: Pretty sure you could just use master to solve most of your problems. My understanding is they store these in cStrings and only throw away resolution when you try and do mathematical operations on them.

It's really a shame we've not had a release in ages with these fixes in them.

gdamore commented 1 year ago

Hi from the future! It's 2023, and we still don't have a release with a fix for this!

leonid-s-usov commented 1 year ago

This is handled by #1752 and should appear in the next build

emanuele6 commented 1 year ago

jq 1.7 released with the fix. closing