scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.46k stars 1.27k forks source link

CAST/MIN(CAST)/MAX(CAST) to text /varchar from float/double/timestamp: the result is trimmed #3109

Closed juliayakovlev closed 6 years ago

juliayakovlev commented 6 years ago

Installation details Scylla version (or git commit hash): master OS (RHEL/CentOS/Ubuntu/AWS AMI): Fedora

CAST/MIN(CAST)/MAX(CAST) to text /varchar from float/double/timestamp: the result is trimmed. Expected result is taken from Cassandra .

Scenario 1: Timestamp

  1. Create table and pre-fill: CREATE TABLE cast_timestamp_test (k int PRIMARY KEY, timestamp_clmn timestamp INSERT INTO cast_timestamp_test (k, timestamp_clmn) VALUES (0, '2017-12-27 11:57:42+0000') INSERT INTO cast_timestamp_test (k, timestamp_clmn) VALUES (1, '2018-02-01 01:00:42+0000') INSERT INTO cast_timestamp_test (k, timestamp_clmn) VALUES (2, '1900-01-01 00:00:00+0000')
  2. Cast timestamp to text: SELECT cast(timestamp_clmn as text) FROM cast_timestamp_test WHERE k=0
  3. Expected result: '2017-12-27T11:57:42.370Z'. Actual result: '2017-12-27T11:57:42'

Scenario 2: Float

  1. Create table and pre-fill: CREATE TABLE cast_float_test (k int PRIMARY KEY, float_clmn float INSERT INTO cast_float_test (k, float_clmn) VALUES (0, -564) INSERT INTO cast_float_test (k, float_clmn) VALUES (1, 3543.45654297)
  2. Cast float to text: SELECT cast(float_clmn as text) FROM cast_float_test; Results:
    • For k=0: Expected result: '-564.0'. Actual result: '-564'
    • For k=1: Expected result: '3543.4565'. Actual result: '3543.46'
  3. MIN(CAST) to text: SELECT min(cast(float_clmn as text)) FROM cast_float_test Expected result: '-564.0'. Actual result: '-564'
avodaniel commented 6 years ago

There is a similar issue for decimal:

Scylla:

cqlsh> SELECT decimal_clmn FROM cast_decimal_test;
125.0

cqlsh> select CAST(decimal_clmn AS text) from cast_decimal_test ;
125

C*:

cqlsh> SELECT decimal_clmn FROM cast_decimal_test;
125.0

cqlsh> select CAST(decimal_clmn AS text) from cast_decimal_test ;
125.0
avodaniel commented 6 years ago

We obviously want to implement toString() method for these data types. Determining number of decimal points to be printed is a bit complicated in Java, it is not clear to me why not to simply print the number in the biggest possible precision.

From Java documentation at https://docs.oracle.com/javase/7/docs/api/java/lang/Float.html#toString(float):

How many digits must be printed for the fractional part of m or a? There must be at least one digit to represent the fractional part, and beyond that as many, but only as many, more digits as are needed to uniquely distinguish the argument value from adjacent values of type float. That is, suppose that x is the exact mathematical value represented by the decimal representation produced by this method for a finite nonzero argument f. Then f must be the float value nearest to x; or, if two float values are equally close to x, then f must be one of them and the least significant bit of the significand of f must be 0.

slivne commented 6 years ago

@tzach / @avikivity

avodaniel commented 6 years ago

I prefer providing all the digits, but there can be a problem with numbers that cannot be converted exactly to the binary numeral system, eg. 0.1 is not stored precisely and we would display the truly stored value: 0.100000001.

avodaniel commented 6 years ago

So we need to implement printing of .0 for integers stored in floating point and we need to implement printing of more decimal digits for floating points with non-zero fraction part. We can either implement Java's algorithm or we can use C++/Boost way: print all possible decimal digits after the dot.

avikivity commented 6 years ago

We can probably use fmt:

The '#' option causes the “alternate form” to be used for the conversion. The alternate form is defined differently for different types. This option is only valid for integer and floating-point types. For integers, when binary, octal, or hexadecimal output is used, this option adds the prefix respective "0b" ("0B"), "0", or "0x" ("0X") to the output value. Whether the prefix is lower-case or upper-case is determined by the case of the type specifier, for example, the prefix "0x" is used for the type 'x' and "0X" is used for 'X'. For floating-point numbers the alternate form causes the result of the conversion to always contain a decimal-point character, even if no digits follow it. Normally, a decimal-point character appears in the result of these conversions only if a digit follows it. In addition, for 'g' and 'G' conversions, trailing zeros are not removed from the result.

elcallio commented 6 years ago

Fwiw, there is no format specifier combo nor ostream modifier flags that are really equivalent to Java's floating point formatter (FloatingDecimal.dtoa). I.e. using a %#g (alt form) gives you 3543.46 and -564.000. And in this form, setting precision for "g" forces full right-side zero padding. Using %#f sortof matches the cases here, but then you'll never have exponential formatting.

While the missing precision I can see an issue with, I'm lost as to why a missing decimal point is significant? String matching in dtests?

juliayakovlev commented 6 years ago

Because of Cassandra returns the digital points. Matching in dtest is performing according to Cassandra results.

avodaniel commented 6 years ago

@juliayakovlev And why not to use abs(x - y) < eps? Comparision of floating-points can be made in a lot of ways and comparing them by their string representation is one of the worst ways.

juliayakovlev commented 6 years ago

It's not connected to comparison. The returned Scylla value is not as it returned from Cassandra. By @slivne request the value, returned from Scylla, must be same as in Cassandra.

slivne commented 6 years ago

Now that we know why there is such a difference and that there is no simple solution for float and double I think that we can work with what we have - @tzach should approve this.

On Wed, Jun 6, 2018 at 8:48 AM, juliayakovlev notifications@github.com wrote:

It's not connected to comparison. The returned Scylla value is not as it returned from Cassandra. By @slivne https://github.com/slivne request the value, returned from Scylla, must be same as in Cassandra.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/3109#issuecomment-394947866, or mute the thread https://github.com/notifications/unsubscribe-auth/ADThCAXG0oD5hgO4tlpcatP0jFtzidAbks5t521LgaJpZM4RPyUp .

tzach commented 6 years ago

Its a compatibility issue we need to fix.

elcallio commented 6 years ago

@tzach, do you seriously want us to implement FloatingDecimal.dtoa in scylla? This seems like a highly risky approach for something that is absolutely not a bug.

slivne commented 6 years ago

@avikivity / @tzach - I agree with Calle - I think we should accept the fact we are not running in java and return the correct result based on the implementation we have - it seems extreme to try and reimplement java code for this.

slivne commented 6 years ago

@tzach ping

tzach commented 6 years ago

On a second look, the difference seems very minor, and we can live with it. A change from 3543.4565 to 3543.46 or vice versa should not impact the application, as long as the driver returns the same type to the application - as it is since it is not changed. Closing this until and if we have a real use case from a user.

juliayakovlev commented 6 years ago

This rounding may be a big problem for banks/investment companies - every company that deals with the money

tzach commented 6 years ago

@juliayakovlev why would they store currency in a float type?

juliayakovlev commented 6 years ago

I worked with OppenheimerFunds. They used Oracle for save the transactions. There were a huge amount of transactions and they never did not round sum/avg.. after calculation.

avodaniel commented 6 years ago

Then all non-trivial decimal points (without the trailing zeroes) can be printed. Scylla would print not-less decimal digits than Java.

But floating numbers are really unintuitive for humans and can give unexpected results. I am surprised that they use them for finance applications. I believe that Scylla should rather optimise decimal numbers to become more attractive for finance companies than spend time with the tuning of rounding of floating points.

---- On Mon, 06 Aug 2018 09:39:08 +0200 juliayakovlev notifications@github.com wrote ----

I worked with OppenheimerFunds. They used Oracle for save the transactions. There were a huge amount of transactions and they never did not round sum/avg.. after calculation. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.