JSON: Parsing and serializing numbers, often undesired E notation

ChristianGruen commented 1 week ago

If JSON numbers are converted to XML and serialized as JSON, it is confusing to end up with an E notation for large numbers. An example:

'100000000000000000000'
=> parse-json()
=> serialize(map { 'method': 'json' })

Obviously, lossless roundtripping is not possible (1e20 is a valid JSON number, so we cannot distinguish it from 100000000000000000000), but as the E notation is much less common than integers, maybe we could try to return more numbers in their integer representation if the result would be equivalent?

Related: #1445

ChristianGruen commented 1 week ago

I see that the Serialization spec states (https://qt4cg.org/specifications/xslt-xquery-serialization-40/Overview.html#json-output):

Implementations may serialize the numeric value using any lexical representation of a JSON number defined in [RFC 7159].

Ideally, we could define a representation that is not implementation-dependent.

ChristianGruen commented 1 week ago

We could use 6.1.6.1.20 Number::toString as a guideline (and preserve the existing rules for NaN and Infinity).

ChristianGruen commented 1 week ago

There is an asymetry between parsing and serializing JSON:

By default, JSON numbers are parsed to xs:double items (unless the number-parser option is used).
All numeric types can be serialized to JSON numbers (doubles, integers, decimals, etc.).

Again, the behavior depends on the implementation. For example, Saxon, eXist and BaseX serialize the xs:decimal 1.00000000000000000000000000001 unchanged, while XMLPrime returns 1:

1.00000000000000000000000000001
=> serialize(map { 'method': 'json' })

Do we think it’s advantageous to serialize numeric types differently, or should we rather serialize all as doubles?

michaelhkay commented 6 days ago

Clearly, adding the number-parser option to parse-json() was an attempt to solve this problem while retaining backwards compatibility. It seems you want to do something a bit more agressive that affects the default behaviour in a way that might not retain backwards compatibility. One option clearly is for parse-json to deliver an integer, decimal, or double depending on the lexical form of the number, in the same way that we do for numeric literals. That could be overridden (to reinstate the 3.1 behaviour) by setting number-parser=xs:double#1.

I'm a bit reluctant to change the serialization rules. If we change them to be more prescriptive, then some implementations will need to change and users may not like the change. I'm reluctant to add serialization parameters to give users more control. Saxon's rule, incidentally is (a) for xs:decimal, never use exponential notation, (b) for xs:double, use exponential notation only outside the range 1e-18 to 1e+18. That seems to be good enough for most people.

ChristianGruen commented 6 days ago

Clearly, adding the number-parser option to parse-json() was an attempt to solve this problem while retaining backwards compatibility. It seems you want to do something a bit more agressive that affects the default behaviour in a way that might not retain backwards compatibility. One option clearly is for parse-json to deliver an integer, decimal, or double depending on the lexical form of the number, in the same way that we do for numeric literals. That could be overridden (to reinstate the 3.1 behaviour) by setting number-parser=xs:double#1.

Yes, an alternative would be to change the parsing. However, I believe this approach would be more invasive than changing serialization.

I'm a bit reluctant to change the serialization rules. If we change them to be more prescriptive, then some implementations will need to change and users may not like the change. I'm reluctant to add serialization parameters to give users more control. Saxon's rule, incidentally is (a) for xs:decimal, never use exponential notation, (b) for xs:double, use exponential notation only outside the range 1e-18 to 1e+18. That seems to be good enough for most people.

What is the reason for choosing 1e+18/why does it differ from what ECMA does (see the link above)?

I believe most users would appreciate it if all implementations behaved similarly. The problem is pretty similar to our fn:xml-to-json challenge.

michaelhkay commented 2 days ago

What is the reason for choosing 1e+18/why does it differ from what ECMA does (see the link above)?

I guess I was unaware that ECMA had chosen 1e+21 as the threshold.

My thinking was primarily to ensure that all integers that had "accidentally" been converted to floating point would be output as integers, and I think 18 is sufficient for that (in fact, 15 probably is).

qt4cg / qtspecs

JSON: Parsing and serializing numbers, often undesired E notation #1583