ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

Representation of Infinity and NaN in DAG-JSON #342

Closed patrsc closed 3 years ago

patrsc commented 3 years ago

DAG-JSON is supposed to support the full IPLD data model. Does this mean that the IEEE 754 special values Infinity, -Infinity, and NaN should also be supported? Currently these values are mapped to null in the JS implementation of DAG-JSON. It is possible to represent these values in DAG-CBOR (also the JS implementation supports it). A possible way to encode them could be the following:

{"/": {"float": "Infinity"}}
{"/": {"float": "-Infinity"}}
{"/": {"float": "NaN"}}
prataprc commented 3 years ago

This is a recurring issue with JSON. Hope JSON gets extended with something like json5

rvagg commented 3 years ago

I don't know if we've properly addressed whether "float" in the data model maps directly to IEEE 754 and therefore whether it should even include infinites and not-a-number. Instinctively I would say that these are not ideal forms to be encoding anyway, but I don't know if I can justify that when IEEE 754 is so widely supported so these specials already have wide utility.

@ipld/core we should probably resolve that base question first, then whether the proposal above is a good idea. It seems like a reasonable approach to me if we accept the place of IEEE 754 specials in the data model.

(mostly though I'd say to just avoid floats entirely in your encoded data, they're a minefield)

patrsc commented 3 years ago

I agree and as far as I know IEEE 754 is the most widely used floating point representation: it is heavily used e.g. in scientific computing and engineering. Also most programming languages support it, so in my opinion it should be available in the IPLD data model to make it also useful in these domains.

rvagg commented 3 years ago

I'm trying to get broader engagement on this question from others (we might need to be patient due to the holiday period). So far this seems to me to be the question space to consider:

  1. Do we allow IEEE 754 to bleed into the Data Model and therefore we support these notions of Infinity -Infinity and NaN a. we could rule them out and reject them at encode/decode time so they aren't even usable b. we could treat them as magic values for a specific codec + language combination that you happen to get free but don't bet on them being transferable (this is implicitly the current approach and probably needs to be explicit if we adopt this) c. or we could make affordances in other codecs to support them (DAG-JSON first, as per that linked issue)
  2. If yes to the above, do we treat these things as part of "float", or do they become new kinds, like "null" is its own kind?
    • this might mean either saying that, by "float", we mean IEEE 754 and they roll up into the "float" kind and we bind ourselves by IEEE 754's limitations and affordances
patrsc commented 3 years ago

I would appreciate if we go for 1c and have these values as part of the float kind, because this is a widely used approach. This would make it possible to directly use outputs of scientific computing applications in the IPLD data model. I don’t know if it is necessary to make the "float" kind mean exactly IEEE 754, but allowing these special values as part of "float" gives at least good compatibility to IEEE 754.

mvdan commented 3 years ago

I admit I don't have a lot of experience with IEEE 754. At least from my point of view, I've never needed to use the special values, so I lean against making them their own kinds in IPLD. It seems to me like IPLD kinds should be very commonly used.

Personally, I think 1a and 1c are our only reasonable choices. 1b is "they might or might not work across codecs/languages", which in my opinion is perhaps even worse than 1a, as it doesn't seem particularly useful given IPLD's goals, and could easily confuse and mislead users into a false sense of security.

Assuming that there are valid use cases for using these special values in IPLD (e.g. scientific data), and that all of our existing codecs and languages can support the special values, I lean towards 1c. If not, then 1a.

warpfork commented 3 years ago

I appreciate @rvagg's breakdown there, and also tend to place my chips around 1a.

If I'm being highly opinionated: Attempting to build application logic around the special values in IEEE754 floats is a bad idea, period, no matter what language and what context you're operating in. Don't do it: you might not regret it; but if I have betting money, I'll bet you'll regret it. If creating sentinel values in an application, do so highly intentionally; don't use interesting corners of the IEEE754 to do it.

If I'm being highly highly opinionated: literally don't do floating point math and bother to preserve the results. Floating point math is a mistake. Floating point math is acceptable only for estimates -- and because it's only acceptable for estimates, in all situations where you have used floating point math to derive some values, you should still store the original numbers in non-floating point form, such that you're ready to re-do any math on that data in more precise ways in the future. (This is logic that I would especially apply in scientific compute, personally. Science has enough reproducibility problems before throwing floating point precision issues into the mix!)

If I'm being less opinionated: I still agree with the considerations of wariness about promoting IEEE754 corner cases into a problem that IPLD has to worry about in every one of our codecs. The problematicness of representing these values in JSON / DAG-JSON alone is cause for pause. We often expect to be able to use JSON / DAG-JSON as a human-readable format -- even in applications that do their defacto data storage and exchange in other formats, because having an isomorphic human-readable format is just so useful for debugging and development -- which means that expanding the IPLD specs in ways that increase the number of documents that are valid in some IPLD codecs, but aren't cleanly transcodable to the JSON / DAG-JSON codecs... doesn't really seem like a move in a desirable direction.

mikeal commented 3 years ago

1a is my preference as well.

rvagg commented 3 years ago

Will be resolved to 1a if https://github.com/ipld/specs/pull/344 is merged.

vmx commented 3 years ago

I've created a follow-up issue to think about best practices when encountering non-finite numbers like NaNn or Infinity: https://github.com/ipld/specs/issues/346

rvagg commented 3 years ago

OK, this got enough agreement that in #344 we've added additional clarity to what "float" in the data model means, which doesn't include NaN and Infinity. There's some further rationale in there but it should be enough to note that there's a very large number of bit combinations for an IEEE 754 float that will resolve as a NaN (less, but still many, for Infinity and -Infinity), and this extends into CBOR too. With https://github.com/ipld/js-dag-cbor/pull/13 we'll be removing them as an option for DAG-CBOR in JS and Go will probably follow suit. #346 has some additional thoughts on these symbols and how best to deal with them in content addressed data—the summary being - it's perfectly reasonable to want to use these symbols, but doing it using the IEEE 754 specials in content addressed data is not a good idea, best to do it in a way where there's a precise 1:1 mapping between the symbol in memory and what it can be in encoded form.