haskell / aeson

A fast Haskell JSON library
Other
1.23k stars 317 forks source link

Should we make a clear value-level distinction between floating and integral numbers? #546

Open leftaroundabout opened 7 years ago

leftaroundabout commented 7 years ago

This basic subject has been variously discussed before –

https://github.com/bos/aeson/issues/227 https://github.com/bos/aeson/issues/181

The thing is that Aeson can't make a distinction between integer values and floating-point values (scientific values) which merely happen to be equal to an integer.

Prelude Data.Aeson> encode 4
"4"
Prelude Data.Aeson> encode 4.0
"4"
Prelude Data.Aeson> encode 4e+30
"4000000000000000000000000000000"
Prelude Data.Aeson> encode 4.53986e+30
"4539860000000000000000000000000"

The former two may be sensible enough. The latter are kinda correct too, at least in a dynamically-typed understanding of numbers; but they're arguably not desirable. It's most obvious when interpreting JSON as a human-readable format, as the aeson-pretty library does. There, this behaviour is under discussion right now. In science, numbers like 4.53986e+30 are very much not something unusual, and nobody would consider it acceptable to print them in expanded, exact integer form if the purpose is human readability.

Now, it's understood that Aeson itself does not really focus on human readability, but even so, such insignificant-digit-monsters are a bit of a pain. The space usage and parsing overhead can't be completely irrelevant.

Dynamic languages like Python actually make a distinction between integers and integral-floats in JSON:

Python 3.4.3 (default, Nov 17 2016, 01:08:31) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import json

In [5]: json.dumps(4)
Out[5]: '4'

In [6]: json.dumps(4.0)
Out[6]: '4.0'

In [7]: json.dumps(4e+32)
Out[7]: '4e+32'

Aeson used to have such a distinction too: before aeson-0.7, Number used Data.Attoparsec.Number, which has separate cases. Now granted, this can lead to some problems too

https://github.com/sol/aeson-qq/issues/4

but I think all in all I think we should follow Python here. So would it be viable to try something like

data Value = Object !Object
           | Array !Array
           | String !Text
           | Scientific !Scientific
           | Int !Int
           | Bool !Bool
           | Null
             deriving (Eq, Read, Show, Typeable, Data)

to have a clear grasp of which numbers are really integers and which ones are merely within experimental error in the range of an integer?

ksromanov commented 7 years ago

It is possible to maintain backward compatibility with

data Value = Object !Object
           | Array !Array
           | String !Text
           | Number !Scientific -- represents either integer, or floating point
           | IntegerNumber !Int -- represents only "small" integers
           | Bool !Bool
           | Null
             deriving (Eq, Read, Show, Typeable, Data)

If we do not touch decoding, the only compatibility issue will be a warning in pattern-matching. This is dangerous only if the library, which uses aeson, is compiled with -Werror in release (this is an apparent error in build scripts).

leftaroundabout commented 7 years ago

Actually, I deliberately didn't keep the name Number because for this proposal to work out properly, we must break backwards compatibility: right now everybody wraps their Int values in Number, but if they keep doing that and we format the Scientific values in decimal-fraction-exponent form this will then break applications whick expect proper int values further down the road. Can't have that. With two entirely new constructors, people must be explicit and actually choose the right constructor for their application.

Most applications probably don't handle the Value ADT manually anyway, but let some Generic-derived FromJSON instance do that for them.

leftaroundabout commented 7 years ago

An entirely different point to think about: having a dedicated constructor for Int values might well improve performance, certainly memory usage, in lots of applications, because very often Int is all you need really.

ksromanov commented 7 years ago

The issue ".0 is added to integers or not added to doubles" is much less important than backward compatibility. People rely on aeson, and if it breaks, they will definitely not be happy.

You should take into account that aeson not only encodes, but also decodes JSON. And JSON format is designed to treat all numbers as Scientific (I personally think that it is a mistake, but we can't get 30 years back a change JavaScript and show him ML). So, encoding integers without trailing ".0" and doubles with is a hack.

So, why not to have a recommendation and let authors of the libraries choose a proper format?

leftaroundabout commented 7 years ago

On the decoding front it's even clearer, isn't it? We definitely want to decode small numbers without fractional part as IntegerNumber, but to handle that, people must follow the warning and actually deal with that constructor. (If they even look for the warning; I personally tend to not use -Wall and deliberately employ incomplete case matches as well-defined points of failure for “my-brain-exploded”–this-can-never-happen cases.)

As long as we permit to just leave out the new constructor, there will be some instances of runtime crashes. It is such runtime errors that would be really bad for the people who rely on aeson. OTOH, if they simply can't compile with the new version, they'll know they must stick to an older one or else change their code to incorporate the new cases.

If that's still too much compatibility broken, there's also the (perhaps controversial) option of reinstating Number as a pattern synonym that will handle both floating and integral numbers:

{-# LANGUAGE PatternSynonyms, ViewPatterns #-}
pattern Number :: Scientific -> Value
pattern Number n <- (matchNumber -> Just n)
 where Number n
         | fromIntegral n' == n  = Int n'
         | otherwise             = Scientific n
        where n' = round n

matchNumber :: Value -> Maybe Scientific
matchNumber (Scientific n) = Just n
matchNumber (Int n) = Just $ fromIntegral n
matchNumber _ = Nothing
phadej commented 7 years ago

I have a concern: the JSON spec doesn't differentiate between how numbers are encoded. /Some/ libraries in other languages do differentiate (like Python mentioned above), but I don't think it's a good argument.

Taken to extreme, one might want to differentiate between 1.23 (Centi) and 1.230000 (Micro).

IMHO these discussions would have better ground if the equivalence of JSON documents would be well-defined in the spec. As to my understanding the lowest common denominator is _.isEqual functions in JavaScript, which doesn't differ between Int and Double; so why should aeson. (Also what Eq Value would look like)

We could encode numbers differently (e.g. Centi and Micro to always append digits), but doing something smart with Double will affect encoding performance.

I don't think that parsing of 4539860000000000000000000000000 or 4.53986e+30 differs significantly in performance. aeson should be safe to use on untrusted input. We cannot make some format perform better, sacrificing performance of other one: it would be a security bug.

Prove me wrong with an actual benchmark (e.g. some array of arrays of such numbers, something the scientific data dump might look like).

ksromanov commented 7 years ago

@phadej - of course, parsing takes the same time, but expanded version of 4.53986e+30 is apparently twice longer. But this is not the major problem - suppose, input data of your program contains a Double with garbage (it is bad, but, unfortunately, does happen). It is very likely that its exponent will be around 300 - the expanded version in JSON will span couple lines full of zeroes. I.e. the size will be ten times larger.

Why not to use floating point format, which was specifically designed to safe space and preserve precision? It is just a single line.

@leftaroundabout - we do not know, if JSON number 10 has type integer or double. It might be double, but encoder (not aeson) stripped trailing .0. Or, instead, 3.0 may be an integer with appended .0. Therefore, we can restore type only if we control the source of JSON.

So, by-default decode should parse numbers as Scientific, and only when user of aeson specifically requests this, decode should parse values to both Int and Scientific. If so, Int will be silent, it will not propagate outside aeson if the user did not specifically request it, hence, no run-time crashes.

On the other hand, encode can work with both Int and Scientific, leaving or stripping trailing zeroes.

If so, why should we break compatibility?

phadej commented 7 years ago

The example in first post doesn't specify aeson version, neither the type

WIth aeson-1.1

Prelude Data.Aeson Data.Scientific> encode (1e10 :: Float)
"1.0e10"
Prelude Data.Aeson Data.Scientific> encode (1e10 :: Double)
"1.0e10"
Prelude Data.Aeson Data.Scientific> encode (1e10 :: Scientific)
"10000000000"

IIRC Scientific behavior was made to encode Integers as integers always. 1e10 is number in floating format, and again /some/ libraries might load them as floating point. E.g. Python

>>> type(json.loads("1e10"))
<type 'float'>
>>> type(json.loads("10000000000"))
<type 'int'>

So we cannot encode big Integers in scientific format, because it will break interoperability with Python.

phadej commented 7 years ago

Also

>>> json.dumps(10000000000000000000000000000L)
'10000000000000000000000000000'
bergmark commented 7 years ago

Thanks for the proposal @leftaroundabout,

There seems to be multiple ways this could be implemented and it's not clear to me that a majority wants the same behavior. I primarily work with ints and then I actually don't want there to be any difference between 3.0 and 3.

Everyone is able to make a custom parser and encoder that can handle Scientific in the way they want, so if you have a particular preference I think this is the way to go.

Value represents JSON and I'm wary of extending it to be more powerful than that.

Regarding backwards-compatibility, I do not think there is a way to make such a change non-breaking or easily migrated, I think it would break lots of code in possibly silent ways.

ksromanov commented 7 years ago

@bergmark unfortunately, Scientific looses information about initial type on purpose. So, it is impossible to reliably decide, either to put 3 or 3.0 (for some people this makes it easier to read).

leftaroundabout commented 7 years ago

@phadej taking it to the extreme and covering every number type you could come up with is certainly not feasible – not just with JSON, but with any fixed-specification data format. It can never be possible to distinguish between 0.13 :: Centi and 0.13 :: Double, with just a single JSON-compliant number literal, and it certainly won't be possible to reliable represent it if I come up with a new type that represents exactly, say, all real numbers of the form ±37tanh q/12 (where q ∈ ℚ ∩ ]-1,1[).

But who cares about that? If you really need reliable fixed precision, then encode it as integers and make the constant denominator part of the specification. If I really need a crazy, super-precision type that can't be adequately represented as standard number literals, I'll bite the bullet and encode it as a specialised record, which includes all necessary implementation details.

Floating-point numbers are in another book, though. Not only is Double the standard type for representing real numbers in science, it's also the not-quite-but-almost-official recommended type in RFC 7159 for all numbers. If I understood correctly, optimising JavaScript implementations actually store everything in Double, and just cram anything that doesn't fit in a NaN payload of such a number!

Hence one might even argue that it was a mistake in Aeson to ever assume you can safely store high-precision integral values in JSON number literals. Perhaps we should have right from the start have used Number :: Double -> Value, and stored 1 :: Integer as "1"?

(No, I'm not really serious about this, but just saying... support for big integers, compatibility with Python, JavaScript and anything else – can't have, all of that.)

Anyway there is a good case to be made for the need to have good support for actual floating-point numbers in a JSON implementation. And the very essence of floating-point is that precision is magnitude-dependent. It should certainly be exact for small integer values, and I suppose we agree that it should also be precise for bigger integer values, regardless of the previous paragraphs. But being exact in general is completely out of question. It just doesn't make any sense to have precision over 30 orders of magnitude (nobody could possibly measure values at such a precision in the real world), but it's essential to have full precision in each of these orders of magnitude individually. Basically, Double has the property of offering reliably the precision that is sufficient for almost all physical quantities, while at the same time also reliably staying in adequate bounds on memory- and processor usage. In another view, floating-point as a representation of physical quantities should be agnostic of the unit system used, so it shouldn't really matter whether I measure a distance in metres or in light-years. But with an encoder that coaxes any floating-point value into expanded integer notation if it happens to equal an integer within the precision that so happens to be assigned to that order of magnitude, this is totally not given! Memory usage will fluctuate a lot through choice of a different unit system, though that's conceptually a no-op.

Ewww... sorry for the rant. What I'm getting at:

phadej commented 7 years ago

@leftaroundabout I personally disagree on your second and partially third point.

IMHO the hack isn't worthwhile; if you need to communicate data preserving the types between aeson and not-aeson. you shouldn't use rely on text representation of the numbers (extreme example: don't transfer zip codes as numbers); IIRC either google or amazon encodes id numbers as strings, because double isn't enough to represent 64 bit integer. Not every number is a physical measurement result!

Which leads us to the third point; if you really need to distinguish Integer and Double, don't use JSON.

And I have to repeat: that Double is serialized using scientific notation when it's big or small enough. I don't see any problem here: (aeson-1.1, and 1.2).

Prelude Data.Aeson> encode (4.53986e+30 :: Double)
"4.53986e30"
leftaroundabout commented 7 years ago

Aha. Well, I did in fact not get that point, thanks – thought it gave "4539860000000000000000000000000" (as indeed it did in aeson-0.11.2). The new behaviour is for sure better, but I'm not exactly satisfied because this only works in direct encoding, i.e. when bypassing the Value step! For example (aeson-1.2.0.0)

Prelude Data.Aeson> encode (3e+27 * sin 2 :: Double)
"2.727892280477045e27"

– OK, good. However, if we first compute toJSON and then render that Value...

Prelude Data.Aeson Data.Aeson.Encoding.Internal> encodingToLazyByteString . value $ toJSON (3e+27 * sin 2 :: Double)
"2727892280477045000000000000"

...ugh, just the same mess as formerly.

This is, concretely, a problem whenever Value is used as a supposedly content-preserving AST representation. Namely, aeson-pretty doesn't have much of a choice but to work on Values – the only alternatives would be

Well, the latter must work, but what a hack is that! And if that's the policy (“only the Encoding matters, toJSON/Value are just optional helpers”), then the question is, what's the point of having the Value type at all?

IOW, isn't this testament that Value does not successfully represent JSON at the moment?

phadej commented 7 years ago

The toJSON argument reminds me of https://github.com/bos/aeson/issues/368.

Do I interpret correctly, that your primary motivation is to make aeson-pretty benefit from toEncoding? One idea that could actually work is to make toEncoding produce a token stream (deep embedding), which is then encoded into ByteString. Then you could freely insert formatting as you wish. I did experimented with that idea last Summer, a very trivial (not tuned or optimised) approach slowed aeson encoding performance to about a half. But I think it can be done without a performance hit. See https://github.com/well-typed/binary-serialise-cbor for an idea. And additional benefit: one could remove {-# INLINE #-} from about every instance, hopefully.

Unfortunately I don't have time to experiment with that idea myself, but it would be cool if someone can make it happen.

Which leads us to the

what's the point of having the Value type at all?

decoding. There aren't toEncoding analogue for FromJSON. We always go thru Value. It is unfortunate: we lose the ordering of the keys (it should matter, but sometimes it does). On that side we can also first parse to the token stream, and have instances working the token stream. A bit like sax and dom xml parsers.

leftaroundabout commented 7 years ago

Making aeson-pretty benefit from toEncoding? I didn't really have that in mind; it might be a possibility.

What I care about it to get sensibly-formatted results from libraries like aeson-pretty, through whatever method is best suited. The method currently used, i.e. via Value, seems the most sensible – it gives a Haskell-native, direct representation of the semantic content of a JSON file, that is suited for when a library wants to work with general JSON data without having to worry about anything parsing-related.

Now my point is that (despite the fact that the JSON specification does not say so, and no matter if you like it or not) the distinction between huge integers and huge, pseudo-integral floats is a semantic distinction. toEncoding already acknowledges that now, by encoding Double and Integer/Scientific in a different manner. toJSON should thus acknowledge it too IMO, but it can't while Value has no way to represent that distinction.

About performance, I frankly don't care much. It is my opinion that whenever performance is even slightly important, one should really use binary serialisation. At least, it makes no sense to format a file pretty if it's so big that performance matters (in which case it's automatically not useful for human reading anymore anyway). It just occured to me that a dedicated integral option might, as a pleasant side-effect, also improve performance, in particular in some of the applications that employ JSON as a simple data-exchange format to other languages. But that's actually speculative.

phadej commented 7 years ago

I disagree:

I don't think that aeson could ever serve all use cases perfectly. Maybe making a separate library e.g. using trifecta or megaparsec to get better parsing errors would be an option too (IIRC better parsing errors was also requested features), and supersede aeson-pretty on the encoding side. I'd like aeson to stay having "machine-to-machine" use case as primary one. But I'm not the one deciding on that.

leftaroundabout commented 7 years ago

Well, yes – to emphasize, what I'm saying is just that I consider the performance advantage of my proposal unimportant, but I'm well aware that most users of Aeson do care about performance. Thus my hope that, if the semantic point which I really care about isn't considered important enough, performance could be the selling point. But then again I don't know if it actually would improve performance at all; very possibly it wouldn't.


You're right of course – where do we draw the line, in terms of what specification-lacking subtleties of the JSON structure we should adapt into the Value structure and which we shouldn't?

I'd reckon that in case of ordering and duplicate-keys in objects, it should be pretty clear to everybody that if you need duplicates and/or ordering, an object is the wrong choice and an array should be used instead. Hence type Object = HashMap Text Value is the way to go, if you ask me.

Now in case of numbers, such an argument could also be made: you might say that “JSON number literals are only safe&exact for small integers (< 253)”. The specification is furthermore clear that scientific notation is supported. In that light, one might well argue that the correct structure would actually be

data Value = Object !Object
           | Array !Array
           | String !Text
           | Number !Double  -- or perhaps a newtype around Double that disallows NaN etc.
           | Bool !Bool
           | Null
             deriving (Eq, Read, Show, Typeable, Data)

That would certainly keep close to the original JavaScript assumptions. It would be fully understood that Number is only for small numbers or inexact large numbers and that exact, possibly-large integers should always be encoded as String! So honestly, I would be quite happy with that. The entire Scientific type always seemed ugly and hackish to me anyway.

But would such a proposal be popular? I doubt it. People seem to like writing exact numbers as numbers. This is not compatibility-safe, but it's certainly reasonable. If Aeson wants to support that nonstandard technique, I'm definitely fine with that too, I just don't agree that it warrants introducing bad behaviour (dozens of insignificant zero digits) into numbers that were never intended to be exact.

leftaroundabout commented 7 years ago

All that said... thinking about it once more, I wonder if the super-JSONal power of Scientific couldn't actually be sufficient to make everybody happy, after all. Should it not in principle be possible to make the distinction I want by consequenty checking

phadej commented 7 years ago

@leftaroundabout nice idea!

Ping @bergmark and @basvandijk

ksromanov commented 7 years ago

@leftaroundabout - unfortunately

 *Data.Scientific> base10Exponent $ fromFloatDigits (1.98e2 :: Float)
0

So, the heuristics should be a bit more complicated. There is floatingOrInteger heuristics in Data.Scientific, which fails at numbers like fromFloatDigits (1E225 :: Double). May be

fromScientific :: Scientific -> Builder
fromScientific s = formatScientificBuilder format prec s
  where
    (format, prec) = case floatingOrInteger s of
                       Right _
                         | (base10Exponent s) == 0 -> (Fixed, Just 0)
                         | otherwise -> (Exponent, Nothing)
                       Left _ -> (Exponent, Nothing)

will work?

leftaroundabout commented 7 years ago

@ksromanov well, we wouldn't be able to use fromFloatDigits, which is itself a bit heuristic. But we don't need to, do we? How about something straight along the lines

instance ToJSON Double where
  toJSON n = Number $ Scientific (round m₁₀) e₁₀
   where (m₂, e₂) = decodeFloat n
         e₁₀ | e₂<0       = floor $ fromIntegral e₂ / 5
             | otherwise  = floor $ fromIntegral e₂ / 5 + 1
         m₁₀ = n * 10^^e₁₀

That would ensure that the base10Exponent is never 0 if the number was encoded from Double.

ksromanov commented 7 years ago

Unfortunately, it is impossible to control inside aeson library, how the Scientific number is created. Somebody could get it after long series of calculations, for instance. :-(

So, unfortunately, we can not rely on base10Exponent, which would be a great solution otherwise.

leftaroundabout commented 7 years ago

@ksromanov could you elaborate what use case you have in mind where somebody would carry out computations in the Scientific type, and then have a problem with extra degits stored in JSON? In most applications, the computations would all be carried out in either Int or Double, then only in the end converted to Scientific by toJSON, and then a library like aeson-pretty could basically read out the desired precision from the exponent, which would always be zero for integers, always nonzero for Double, and whatever comes in for Scientific.

akhra commented 5 years ago

Breezing by to complain about Python interop. Nothing constructive to add at the moment, but I figure it's worth adding a real-use data point.

milesrout commented 5 years ago

Randomly saw this issue and want to say that JSON's semantics are JSON's semantics. If someone wants to encode data with a distinction between integers and non-integer numbers, they should use a better format than JSON to do it. JSON is a bad format if you care about that. Use one of the many others. If on the other hand you want to use JSON and you want JSON's semantics (for example, if you want to be compatible with literally everything else that uses JSON) then you probably want JSON's semantics and a JSON library should give you them, at least by default.

leftaroundabout commented 5 years ago

@milesrout JSON semantics are “all numbers are 64-bit floats, there's no such thing as a big integer”. So, the morally “right” thing would be to encode all Integers as strings, instead of using JSON numbers at all. This would be really awkward though, especially for a Haskell interface. In that light it's pragmatic to go Python's route and allow big integers, but if you go that route then, like Python, you should also make a distinction between them and floats because a big float, despite being equal to a big integer, is not adequately represented as such.

akhra commented 5 years ago

So, the morally “right” thing would be to encode all Integers as strings, instead of using JSON numbers at all. This would be really awkward though, especially for a Haskell interface.

I'm consuming an upstream service which uses string encoding. It prevents me using a derived instance for my final data type, but that's about it. It does require a manual read, but as long as you can trust the upstream type safety, that no more volatile than decode already is.

llelf commented 5 years ago

@phadej,

So we cannot encode big Integers in scientific format, because it will break interoperability with Python.

But it's already broken

>>> json.dumps(json.loads('10')), json.dumps(json.loads('1.0e1'))
('10', '10.0')
λ> encode . decode @Value <$> ["10", "1.0e1"]
["10","10"]
phadej commented 5 years ago

I'm tired of arguing.

Sent from my iPhone

On 3 Jan 2019, at 10.19, Antonio Nikishaev notifications@github.com wrote:

@phadej,

So we cannot encode big Integers in scientific format, because it will break interoperability with Python.

But it's already broken

json.dumps(json.loads('10')), json.dumps(json.loads('1.0e1')) ('10', '10.0') λ> encode . decode @Value <$> ["10", "1.0e1"] ["10","10"] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

phadej commented 5 years ago

Also note that in some frameworks like servant-lucid the rendering is pure, so in those use cases won't benefit from this. But (at least servant-client) could force m ~ Identity

Sent from my iPhone

On 3 Jan 2019, at 10.19, Antonio Nikishaev notifications@github.com wrote:

@phadej,

So we cannot encode big Integers in scientific format, because it will break interoperability with Python.

But it's already broken

json.dumps(json.loads('10')), json.dumps(json.loads('1.0e1')) ('10', '10.0') λ> encode . decode @Value <$> ["10", "1.0e1"] ["10","10"] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.