man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.25k stars 79 forks source link

Improve aggregation numerical accuracy #603

Open qc00 opened 1 year ago

qc00 commented 1 year ago

Is your feature request related to a problem? Please describe.

The test_hypothesis_mean_agg_dynamic test can fail on the below float inputs:

0                   1
1                   2
2                   1
3                   3
4                   1
5    9007199254740992
6                   1
7   -9007199254740992

The correct mean is obviously 1.125, but our code thinks it's 1.0 whereas Pandas returns 1.5, so neither is right. (BTW, Excel gets it right!)

Describe the solution you'd like

Use more accurate floating-point algorithms like the Kahan summation algorithm.

Describe alternatives you've considered

Additional context

Stack ``` 1: FAILED tests/unit/arcticdb/version_store/test_aggregation_dynamic.py::test_hypothesis_mean_agg_dynamic - AssertionError: DataFrame.iloc[:, 0] (column name="a") are different 1: 1: DataFrame.iloc[:, 0] (column name="a") values are different (100.0 %) 1: [index]: [0] 1: [left]: [1.0] 1: [right]: [1.25] 1: Falsifying example: test_hypothesis_mean_agg_dynamic( 1: lmdb_version_store_dynamic_schema=NativeVersionStore: Library: local.test_561_2023-06-30T11_34_00_053009, Primary Storage: lmdb_storage., 1: df= 1: grouping_column a 1: 0 0 1.000000e+00 1: 1 0 2.000000e+00 1: 2 0 1.000000e+00 1: 3 0 3.000000e+00 1: 4 0 1.000000e+00 1: 5 0 9.007199e+15 1: 6 0 1.000000e+00 1: 7 0 -9.007199e+15 1: , 1: ) 1: 1: You can reproduce this example by temporarily adding @reproduce_failure('6.72.4', b'AXic42AAAkYQwQRnMYNZTBCO/P//HxggACLECBcCcgB18QY0') as a decorator on your test case ```
jamesmunro commented 12 months ago

Numpy seems to get this wrong as well

>>> a = np.array([1, 2, 1, 3, 1, 9007199254740992, 1, -9007199254740992])
>>> a.dtype
dtype('int64')
>>> a.mean()
1.0
>>>

Looks like this is a result of coercing to float first.

>>> a.sum() / len(a)
1.125
>>> a.astype('float').mean()
1.0

Having numpy as a reference implementation is useful, which makes this low priority.

I get 1.0 for Pandas as well.

>>> s = pd.Series([1, 2, 1, 3, 1, 9007199254740992, 1, -9007199254740992])
>>> s.mean()
1.0

It's possible a 1.5 (in an older version of pandas?) is a result of treating MAXINT as a NaN.

jamesmunro commented 12 months ago

@qc00 Given my statement above about being inline with numpy and pandas behavior I think we should change the test.

qc00 commented 12 months ago

I should clarify that the test uses floating-point numbers. I listed the values as integers to avoid inexactness.

Given Pandas is no longer guaranteed to be numpy-based, I guess we should match the former.