stargate / data-api

JSON document API for Apache Cassandra (formerly known as JSON API)
https://stargate.io
Apache License 2.0
14 stars 16 forks source link

Suggest DECIMAL cql type to be returned as string #1608

Open hemidactylus opened 1 month ago

hemidactylus commented 1 month ago

The rounding issues that essentially led to creating the DECIMAL cql type would dictate such columns to be returned as strings, to let callers/users/clients to do the right thing. Otherwise, unexpected results may occur.

Principle

In binary floating-point 3.3 - 2.2 - 1.1 != 0, because of rounding issues during the calculation. Decimal holds a special representation for numbers with decimals that avoids that.

In pure CQL

This table has two columns (decimal, float) populated the same, which demonstrate this difference:

CREATE TABLE default_keyspace.tmp_d (
    a int PRIMARY KEY,
    b decimal,
    c float
);
cassandra@cqlsh:default_keyspace> select * from tmp_d  ;

 a  | b    | c
----+------+------
 -1 | -1.1 | -1.1
 -2 | -2.2 | -2.2
  3 |  3.3 |  3.3

The DECIMAL column:

cassandra@cqlsh:default_keyspace> select sum(b) from tmp_d;

 system.sum(b)
---------------
           0.0

The FLOAT column:

cassandra@cqlsh:default_keyspace> select sum(c) from tmp_d;

 system.sum(c)
---------------
   -1.1921e-07

Data API

When reading this table with the Data API (dockerized 1.0.18 to be precise), this is what I see:

{
    "data": {
        "documents": [
            {
                "a": -1,
                "b": -1.1,
                "c": -1.1
            },
            {
                "a": -2,
                "b": -2.2,
                "c": -2.2
            },
            {
                "a": 3,
                "b": 3.3,
                "c": 3.3
            }
        ],
        "nextPageState": null
    },
    "status": {
        "warnings": [
            "Zero filters were provided in the ..."
        ]
    }
}

As a result, there is a lossy math when the original intent of the column type was precisely to avoid that.

tatu-at-datastax commented 1 month ago

I do not follow: since DECIMAL is specifically 10-based value it should have exact textual representation.

But even with 2-based (binary) floating-point I am not sure I get the point: yes, number representation as 10-based text may need to be rounded -- but would of JSON String make any difference? How would that help?

tatu-at-datastax commented 1 month ago

Ok, based on discussions this is concern wrt JSON handling differing between JSON Numbers and JSON Strings -- so if server-side was to use 2-based FPs (float, double) for all numbers, then use of JSON Strings would avoid the issue.

As thing are, Data API uses Java BigDecimal for all JSON numbers with fractions so we should be good -- but if testing shows problems, we def need to address.

vkarpov15 commented 4 weeks ago

My 2 cents: I think returning decimals as numbers is the right way to go. There shouldn't be any loss of precision since JSON serialization isn't doing any arithmetic. And it seems semantically cleaner for numbers to end up as numbers in JSON.

However, serializing decimals as strings wouldn't be unprecedented. For example, MongoDB's JavaScript BSON serializer only supports Decimal128.fromString(), there's no way to convert a JavaScript number (binary floating point) to a Decimal128 without going through a string first.

amorton commented 3 weeks ago

Reading through...

In pure CQL

Not sure how this exactly relates to the issue of encoding decimal, yes there is a reason for having floats and decimals.

I will note that the Decimal and Float sums above are the same, depending on the number of decimal places and this is why we have the decimal

In [5]: s = "-9.9362e-10"

In [6]: f = float(s)

In [8]: print(f"10 decimal places: {f:.10f}")
10 decimal places: -0.0000000010

In [12]: print(f"1 decimal places: {f:.1f}")
1 decimal places: -0.0

# using new rounding https://peps.python.org/pep-0682/
In [20]:  print(f"1 decimal places: {f:z.1f}")
1 decimal places: 0.0

In [11]: float("-0.0") == float("0")
Out[11]: True

In [14]: float("-0.0") == int("0")
Out[14]: True

# false if we use the full exponent 
In [15]: float(s) == int("0")
Out[15]: False

Data API

I am not following the example when it says "As a result, there is a lossy math when the original intent of the column type was precisely to avoid that." the values returned by the Data API are exactly the same as those returned by CQL.

For example, MongoDB's JavaScript BSON serializer only supports Decimal128.fromString(), there's no way to convert a JavaScript number (binary floating point) to a Decimal128 without going through a string first.

This is the mongo docs for this https://github.com/mongodb/specifications/blob/master/source/bson-decimal128/decimal128.md

and https://www.mongodb.com/docs/manual/reference/mongodb-extended-json/#mongodb-extended-json--v2-

Thoughts...

not decisions just things to be considered....

JSON view on numbers

As the JSON - Data Types page says....

Numbers in JSON are agnostic with regard to their representation within programming languages. While this allows for numbers of arbitrary precision to be serialized, it may lead to portability issues... There is no inherent precision loss in serializing a machine-level binary representation of a floating-point number (like binary64) into a human-readable decimal representation (like numbers in JSON) and back since there exist published algorithms to do this exactly and optimally.

My view on this is the JSON protocol says it has a way to encode an number that is not infinite or NaN. And then says it is up to the parsing code to alert the user if there is an issue with the decoding such as if the code cannot represent the number.

Mongo DB view on Decimal

If a decimal is declared, they are always using an EJSON value { "$numberDecimal": "<number>" } this is the same for their Canonical and Relaxed formats for their JSON formatting. The BSON formatting is "IEEE 754-2008 128-bit decimal floating point " , this is what they are sending over the wire and remember BSON as a little richer data types.

And if I read the spec correctly they are telling clients to not round trip into the native decimal types.

Telephone number problem

What happens to the classic telephone number problem, the string "012345" is a string of digits not a number.

Relaxed Vs Strict

We are building relaxed representation first, and the idea is to be as basic a JSON as we can. Any read will return return the schema, if a client wants to go beyond handling a number as JSOn number it can use the schema returned to know what is a decimal. The parsing in Python and Node.js and other libraries allows for special handling of fields and floats to handle these cases.

tatu-at-datastax commented 2 weeks ago

Related: filed #1654 wrt "other direction" for decimal type.