w3c / EasierRDF

Making RDF easy enough for most developers
267 stars 13 forks source link

Address Pitfalls of Numerical Datatypes in RDF #82

Open jmkeil opened 3 years ago

jmkeil commented 3 years ago

There are a couple of issues with numerical datatypes that make the accurate use of RDF for numerical data error-prone.

The use of xsd:float and xsd:double entails a risk

In most cases, xsd:decimal would be a better choice:

The use of xsd:decimal for value representation does not considerably impede the use of floating point arithmetic for calculations (e.g. for performance reasons), as the conversion is trivial. In contrast, if a rounding of the lexical representation must be avoided, the other direction would require non standard-conform and (depending on the framework) probably cumbersome to implement custom lexical mappings, and is not always possible (e.g. inside of SPARQL queries).

However, I don't see awareness for these issues in general and especially in teaching material.

Further, RDF unnecessarily inherits limitations from XSD: Exponential notation is only supported for xsd:float and xsd:double, but not for xsd:decimal (and derived datatypes). It was not included into xsd:decimal as the requirement was already meet with the precisionDecimal datatype, which however, did not become a built-in datatype in RDF. This tempts users to use xsd:double even if not appropriated. The shorthand syntax in Turtle, TriG and SPARQL additionally amplifies this, as xsd:double might be used even if not intended.

(A more detailed discussion of the issues can be found in arXiv:2011.08077 and some reviewer comments on it.)

Possible Actions

I think the following actions would help to ease the accurate representation of numbers in RDF:

  1. Enable exponential notation for xsd:decimal (and derived datatypes) in RDF.
  2. Emphasis in teaching material the implicated risk of numerical issues and the only partial coverage between lexical space and value space of xsd:float/xsd:double resulting in rounded values after the lexical mapping.
  3. Enable tools to hint for the use of xsd:decimal in favor of xsd:float and xsd:double and to warn users if a lexical xsd:float or xsd:double value was entered which would require rounding during the lexical mapping.
  4. Maybe change Turtle, TriG and SPARQL syntax to use exponential notation as shorthand syntax for xsd:decimal instead of xsd:double.

One to three would not cause any backward compatibility problems. Four however, would obviously cause backward compatibility problems ins software, but might at the same time increase the accuracy of value representations in existing RDF documents without change.

Further, one could think about adding mandatory support for precisionDecimal (to have an arbitrary precision datatype with a representation of Infinite), but that is a new feature and goes beyond making RDF easier.

jmkeil commented 1 year ago

To make this issue more actionable, here a little more details, some thoughts about requirements and a solution sketch.


Problem

  1. For the datatypes xsd:float and xsd:double multiple lexical representations get mapped to the same value using rounding. For example, "0.1"^^xsd:float gets mapped to 0.100 000 001 4.... This fools data curators to state precise numbers, when actually stating slightly different values.
  2. xsd:float and xsd:double force compliant implementations to use floating point arithmetic, or to use rounded input values for a calculation with decimal arithmetic with arbitrary precision. xsd:decimal forces full compliant implementations to use decimal arithmetic with arbitrary precision, or forces limited compliant implementations to preserve a precision of at least 16 digits (one more than double precision floating point arithmetic guaranties). Even popular implementations (e.g. Virtuoso) fail to comply to this. The actually needed precision of calculations is a matter of the application problem, not the data used. However, RDF requires data curators to make a decision about them. Currently, RDF restrict the selection of the arithmetic reasonable for a problem, which might make compliant implementations less efficient, harder or impossible to write (e.g. due to hardware capabilities, response time constraints and language/library support), or less precise than required.
  3. Syntactic sugar in JSON-LD, Turtle, TriG and SPARQL, as well as missing support for infinite values, NaN (see e.g. OM issue 57) and the exponential notation support tempts data curators to use xsd:float and xsd:double and thereby to distort the stated values.

For a more detailed description of the problem refer to The Problem with XSD Binary Floating Point Datatypes in RDF (talk recording).

Requirements

A couple of requirements follows from these problems:

  1. Avoid partial coverage of lexical spaces by value spaces to avoid ambiguity and to not fool data curators.
  2. Do not restrict the choice of an arithmetic with the data.
  3. Permit exponential notation for arbitrary precise numbers.
  4. Existing data can be used by new software.
  5. Existing distorted data get fixed.
  6. Enable explicit binary representation of IEEE 754 binary32 (float) or IEEE 754 binary64 (double) values that can not get misinterpreted as decimal number.

Solution Draft

As a basis for discussion I would like to propose the following (challenging/maybe unrealistic) list of changes to address the problem:

  1. Add exponential notation to the lexical space of xsd:decimal.
  2. Add NaN, -Inf, Inf, and +Inf to the lexical space and value space of xsd:decimal.
  3. Relax the minimal 16 digits constraint for xsd:decimal on minimally conforming implementations.
  4. Add datatype …:HexFloat with lexical spaces 0x0000 to 0xffff/0xFFFF and value space of IEEE 754 binary32.
  5. Add datatype …:HexDouble with lexical spaces 0x00000000 to 0xffffffff/0xFFFFFFFF and value space of IEEE 754 binary64.
  6. Interpret non integer numbers in JSON-LD as xsd:decimal instead of xsd:double.

    • Permitted according to ECAM-404.
    • Permitted according to RFC8259:

      This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

      Summarized: Expect non IEEE 754 binary64 values to get approximated.

      • Possible due to point 1, 2 and 3.
  7. Interpret numbers in exponential notation in Turtle as xsd:decimal instead of xsd:double. Possible due to point 1.
  8. Interpret numbers in exponential notation in TriG as xsd:decimal instead of xsd:double. Possible due to point 1.
  9. Interpret numbers in exponential notation in SPARQL as xsd:decimal instead of xsd:double. Possible due to point 1.
  10. Interpret explicitly typed xsd:float and xsd:double literals as xsd:decimal. Possible due to point 1 and 2.
  11. Deprecate xsd:float and xsd:double. Possible due to point 6 to 10.

Compatibility Considerations

Old implementations with new data:

New implementations with old data:

Old implementations interacting with new/upgraded implementations:


This would of course not be the easiest change to the RDF standards, especially as it also touches the XML standards. But I think, it is important to address this to make RDF a reliable framework for the representation of numeric data. What do you think about it? (e.g. @afs, @VladimirAlexiev, @gkellogg, @namedgraph)

namedgraph commented 1 year ago

@danbri might have an opinion :)