intake / pandas-decimal

decimal extension for pandas
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

The future? #3

Open martindurant opened 1 year ago

martindurant commented 1 year ago

This repo currently is a nice little proof-of-concept.

To be a viable package in the pydata realm, it should have

Should we poke pandas directly to consider upstreaming?

martindurant commented 1 year ago

@jreback I thought you would be just the kind of person that might have an opinion on this little project.

jreback commented 1 year ago

yea i think we would highly consider this to vendor even

but starting as an optional is great

martindurant commented 1 year ago

@jrbourbeau - MattR suggested this would be of interest with your ongoing work using spark-generated data types. Maybe you have already solved everything with arrow's decimal type?

jrbourbeau commented 1 year ago

With pandas 1.5+ we can now use DataFrames with pyarrow-backed decimal128 data. Here's an small snippet which demonstrate that in action:

import decimal
import pandas as pd
import dask.dataframe as dd
import pyarrow as pa

pa_dtype = pa.decimal128(precision=7, scale=3)

data = pa.array(
    [
        decimal.Decimal("8093.012"),
        decimal.Decimal("8094.123"),
        decimal.Decimal("8095.234"),
        decimal.Decimal("8096.345"),
        decimal.Decimal("8097.456"),
        decimal.Decimal("8098.567"),
    ],
    type=pa_dtype,
)
df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))
ddf = dd.from_pandas(df, npartitions=3)

# Some pandas operation
print(f"{df.x * 3 - 1 = }")
# Equivalent Dask operation
print(f"{(ddf.x * 3 - 1).compute() = }")

Experimenting locally, that approach seems to work well for working with decimal data (though I've not done exhaustive testing). Does that align well with pandas-decimal?

jamescooke commented 1 year ago

On pandas>=2 and pyarrow>=12, there seems to be quite a bit of mileage possible with pd.ArrowDtype(pa.decimal128(...)) as mentioned by @jrbourbeau above.

Should this be wrapped in pandas-decimal, or has progress been made elsewhere? Or maybe some other solution has arisen?

martindurant commented 1 year ago

Since most things should be possible with the arrow extension type alone, there has been no further development or attention here. I daresay I can archive this. The only advantage I see (aside from the technical differences) is the ability to have exact decimals in pandas without requiring arrow. That ship has probably gone.

jamescooke commented 1 year ago

I daresay I can archive this.

Agree - but - the PyPI page is on the first page of search results for me for "pandas decimal", and I've not found any good write-up of how to "do Decimal in pd DataFrames" - do you have any tips / resources? Maybe they could be linked from the README? 🙏🏻

The only advantage I see (aside from the technical differences) is the ability to have exact decimals in pandas without requiring arrow. That ship has probably gone.

Agree the ship has sailed - that's a pretty niche use case. Functions like pd.read_gbq() require pandas-gbq which installs pyarrow, so it's already in the Python environment for many.

martindurant commented 1 year ago

Agree - but - the PyPI page is on the first page of search results for me for "pandas decimal", and I've not found any good write-up of how to "do Decimal in pd DataFrames" - do you have any tips / resources? Maybe they could be linked from the README?

This is a good point. Maybe I should write up a blog demonstrating decimals in pandas with pyarrow. I can publish this on Anaconda and link in the README as you suggest, but it will take me a little time to get to.

jamescooke commented 1 year ago

A blog with some more guidance about decimals with PyArrow would be great.

Meanwhile I've been digging into this a little more at work to see if we should use decimal128 for a new part of our pipeline. I agree that decimal128 does provide vectorization and fixed point arithmetic - but there are two features that seem to be missing that I've not found a fix for.

Describe has problems

In [1]: import decimal    
   ...: import pandas as pd                                                                              

In [2]: import pyarrow as pa                                                                             

In [3]: pa_dtype = pa.decimal128(precision=7, scale=3)

In [4]: data = pa.array(                                                                                 
   ...:     [
   ...:         decimal.Decimal("8093.012"),                                                             
   ...:         decimal.Decimal("8094.123"),                                                             
   ...:         decimal.Decimal("8095.234"),                                                             
   ...:         decimal.Decimal("8096.345"),
   ...:         decimal.Decimal("8097.456"),                                                             
   ...:         decimal.Decimal("8098.567"),                                                             
   ...:     ],                                                                                           
   ...:     type=pa_dtype,                          
   ...: )                                                                                                
   ...: df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))

In [5]: df.describe()
---------------------------------------------------------------------------                              
ArrowInvalid                              Traceback (most recent call last)
...
ArrowTypeError: int or Decimal object expected, got numpy.int64

Simple ops have problems

Continuing with the df above - adding 1 to column 'x' is fine:

In [7]: df['x'] + 1                       
Out[7]: 
0    8094.012
1    8095.123
2    8096.234
3    8097.345
4    8098.456
5    8099.567
Name: x, dtype: decimal128(23, 3)[pyarrow]

However, if we change the precision and scale to use more bits things get hairy:

In [13]: df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa.decimal128(precision=38, scale=3)))

In [14]: df['x'] + 1                                                                                                                                                                                               
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
...
ArrowInvalid: Decimal precision out of range [1, 38]: 39

We know from above where df['x'] + 1 worked and returned values in decimal128(23, 3)[pyarrow] that the results of adding 1 will fit in the decimal128(precision=38, scale=3) (right? correct me if I'm wrong 🙏🏻 ) . But there appears to be a problem in where the spec of the decimals is not getting managed correctly.

Current conclusion

My guess that there are work arounds for the issues above - plus possibly even fixes upstream in pandas. However, this extra complexity for vectorization and smaller memory footprint might not be worth it at this time for us. My plan is to push ahead with decimal.Decimal in the DF and see what walls we collide with 🙏🏻 .

martindurant commented 1 year ago

I must admit, that I had not tried to do too much with arrow decimals. Your notes here are pretty disappointing! If you use dtype="decimal[3]" from this package, describe() at least does work. Also, we are significantly faster:

In [23]: df = pd.DataFrame({"a": [0, 1] * 100000}, dtype=pd.ArrowDtype(pa.decimal128(precision=7, scale=3)))

In [26]: %timeit df.a.std()
1.28 ms ± 7.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [27]: df = pd.DataFrame({"a": [0, 1] * 100000}, dtype="decimal[3]")

In [28]: %timeit df.a.std()
340 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

As for precision, you would expect maybe to be bounded by integers up to 2**63 (~19 decimal places for numbers near 1). I don't know why arrow has both precision and scale...

For a conclusion, decimal./Decimal may solve your problem, but it is orders of magnitude slower.

In [35]: df["a"] = df.a.map(lambda x: decimal.Decimal(str(x)))

In [39]: %timeit df.a.mean()  # this is MEAN because std fails
12.9 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
jamescooke commented 1 year ago

Regarding pyarrow being required by pandas, I found this just now: https://github.com/pandas-dev/pandas/issues/52509