grantjenks / python-runstats

Python module for computing statistics and regression in a single pass.
http://grantjenks.com/docs/runstats/
Other
96 stars 19 forks source link

Support Arbitrary Data Types (like Decimal) #18

Open timothydevries opened 7 years ago

timothydevries commented 7 years ago

We use Decimal instead of float due to a requirement for exact representation of amounts when using floating point arithmetic. We would like to use this package with Decimal amounts, and potentially other data types as well.

Supporting arbitrary types can be done by converting all hard-coded floats into a custom datatype which can be specified as an optional argument in the Statistics/Regression constructor.

Would this be considered as a useful feature? We can submit a pull request which makes the necessary changes. This pull request will not likely include a fast Cython implementation.

grantjenks commented 7 years ago

Maybe. A few concerns/thoughts:

  1. The Python code is adapted from C++ code published at https://www.johndcook.com/blog/skewness_kurtosis/ Part of the appeal of the package is that the algorithm is numerically stable and accurate. The derivation of the algorithm is actually over my head. So my concern is that supporting arbitrary data types could actually void those guarantees. In fact, the conversion of everything to floats fixed a bug that originally occurred when pushing integer values. How can we know that decimal support will remain numerically stable and accurate? It seems trivial to demonstrate that with decimal precision constrained to only 0 or 1 digits then the algorithm would quickly become inaccurate.

  2. I want to keep the fast.pyx and core.py modules as similar as possible for maintainability. If your pull request only supports core.py then the change must be small and well commented. It also presents difficulties from a documentation standpoint.

  3. Can you not work around the problem by converting the results to the desired decimal precision? It seems that if the precision of floats exceeds the precision of your decimals then this could be an option. Floats in Python have about 15-17 significant decimal digits of precision.

timothydevries commented 7 years ago
  1. That's a good point. We have only tested only with floating-point based types, which do seem to work. However, if arbitrary data types are supported but the algorithm may not work correctly in some cases for integer types, then that would have to be documented or the algorithm may need to change. I can see why this would be undesirable. As for decimal support, we are using the python Decimal class, so we don't encounter stability or accuracy issues.

  2. The change would involve only adding a single optional argument to the constructor and a thin wrapper around each declaration of a constant float, so hopefully this would be considered small enough to be maintainable.

  3. Unfortunately we can't convert float back to Decimal or other related types while retaining precision. We lose precision when initially converting from Decimal to float. We use Money-based types which use Decimal for the amount, and expect operations such as addition and subtraction between Decimals with fractional amounts to keep their precision. For example, from https://docs.python.org/2/library/decimal.html , we expect 0.1 + 0.1 + 0.1 - 0.3 to sum to zero, whereas when using float it actually sums to a value close to zero. We use Decimal over float in order to avoid this problem when carrying out a number of operations on numerical amounts, and would like to be able to keep using this data type while using several of the functions provided by this package.

Please let us know whether you consider these issues to be workable with the current version of the package, or whether we should continue to use a fork of core.py for Decimal and Money amounts.

grantjenks commented 7 years ago

I thought it would be money/currency. That's when I last used the decimal module too :)

Regarding (1) and (3), I am not sure you understand. How many digits of precision do you need the internal calculations to have? You can always quantize the result:

>>> from decimal import Decimal
>>> value = 0.1 + 0.1 + 0.1 - 0.3
>>> d = Decimal(value)
>>> d
Decimal('5.5511151231257827021181583404541015625E-17')
>>> q = d.quantize(2)
>>> q
Decimal('0')

By default, the decimal module only gives you 28 digits of precision:

>>> from decimal import getcontext
>>> getcontext()
Context(prec=28, rounding=ROUND_HALF_EVEN, Emin=-999999999, Emax=999999999, capitals=1, flags=[Rounded, Inexact], traps=[DivisionByZero, InvalidOperation, Overflow])
>>> Decimal('22') / Decimal('7')
Decimal('3.142857142857142857142857143')

This was also informative: http://stackoverflow.com/questions/3840793/rounding-standards-financial-calculations

And regarding (2), please do not pepper "Decimal" or "TypeName" around all the constants. That seems very error prone. Better to lift them out. Are there many?

timothydevries commented 7 years ago

Yes, in practice we can probably get away with rounding the result. We use the default 28 digits of precision for Decimal which is enough for our calculations thus far, and we could probably reduce this slightly. We mainly want to use Decimal for consistency purposes, by using one data type throughout all of our code, thereby avoiding some of the pitfalls that we would normally encounter by using floats. We would like to be able to call a statistics library such as this one with Decimal amounts directly, without converting to and from float, and without explicit rounding. Another benefit of this approach is that any metadata associated with the data type used (such as currency amounts) is carried through and doesn't need special handling during conversion.

For point 2, there are a fair few places which have float constants, which do need to be lifted out or converted in-place. It comes to around 15 lines that need changing, not including the tests, which would also need a fair amount of changes.

grantjenks commented 7 years ago

I tried prototyping some changes to see how it would look/work and I didn't like what it did to the code. I think I get what you're trying to do but it doesn't really fit my use case or how I intended runstats to be used.

That said, I am open to hearing from others. So I want to leave this open with the label "Feature Request (Needs Votes)"