Open martindurant opened 1 year ago
@jreback I thought you would be just the kind of person that might have an opinion on this little project.
yea i think we would highly consider this to vendor even
but starting as an optional is great
@jrbourbeau - MattR suggested this would be of interest with your ongoing work using spark-generated data types. Maybe you have already solved everything with arrow's decimal type?
With pandas
1.5+ we can now use DataFrames with pyarrow
-backed decimal128
data. Here's an small snippet which demonstrate that in action:
import decimal
import pandas as pd
import dask.dataframe as dd
import pyarrow as pa
pa_dtype = pa.decimal128(precision=7, scale=3)
data = pa.array(
[
decimal.Decimal("8093.012"),
decimal.Decimal("8094.123"),
decimal.Decimal("8095.234"),
decimal.Decimal("8096.345"),
decimal.Decimal("8097.456"),
decimal.Decimal("8098.567"),
],
type=pa_dtype,
)
df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))
ddf = dd.from_pandas(df, npartitions=3)
# Some pandas operation
print(f"{df.x * 3 - 1 = }")
# Equivalent Dask operation
print(f"{(ddf.x * 3 - 1).compute() = }")
Experimenting locally, that approach seems to work well for working with decimal data (though I've not done exhaustive testing). Does that align well with pandas-decimal
?
On pandas>=2
and pyarrow>=12
, there seems to be quite a bit of mileage possible with pd.ArrowDtype(pa.decimal128(...))
as mentioned by @jrbourbeau above.
Should this be wrapped in pandas-decimal
, or has progress been made elsewhere? Or maybe some other solution has arisen?
Since most things should be possible with the arrow extension type alone, there has been no further development or attention here. I daresay I can archive this. The only advantage I see (aside from the technical differences) is the ability to have exact decimals in pandas without requiring arrow. That ship has probably gone.
I daresay I can archive this.
Agree - but - the PyPI page is on the first page of search results for me for "pandas decimal", and I've not found any good write-up of how to "do Decimal in pd DataFrames" - do you have any tips / resources? Maybe they could be linked from the README? 🙏🏻
The only advantage I see (aside from the technical differences) is the ability to have exact decimals in pandas without requiring arrow. That ship has probably gone.
Agree the ship has sailed - that's a pretty niche use case. Functions like pd.read_gbq()
require pandas-gbq
which installs pyarrow
, so it's already in the Python environment for many.
Agree - but - the PyPI page is on the first page of search results for me for "pandas decimal", and I've not found any good write-up of how to "do Decimal in pd DataFrames" - do you have any tips / resources? Maybe they could be linked from the README?
This is a good point. Maybe I should write up a blog demonstrating decimals in pandas with pyarrow. I can publish this on Anaconda and link in the README as you suggest, but it will take me a little time to get to.
A blog with some more guidance about decimals with PyArrow would be great.
Meanwhile I've been digging into this a little more at work to see if we should use decimal128
for a new part of our pipeline. I agree that decimal128
does provide vectorization and fixed point arithmetic - but there are two features that seem to be missing that I've not found a fix for.
In [1]: import decimal
...: import pandas as pd
In [2]: import pyarrow as pa
In [3]: pa_dtype = pa.decimal128(precision=7, scale=3)
In [4]: data = pa.array(
...: [
...: decimal.Decimal("8093.012"),
...: decimal.Decimal("8094.123"),
...: decimal.Decimal("8095.234"),
...: decimal.Decimal("8096.345"),
...: decimal.Decimal("8097.456"),
...: decimal.Decimal("8098.567"),
...: ],
...: type=pa_dtype,
...: )
...: df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))
In [5]: df.describe()
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
...
ArrowTypeError: int or Decimal object expected, got numpy.int64
Continuing with the df
above - adding 1
to column 'x' is fine:
In [7]: df['x'] + 1
Out[7]:
0 8094.012
1 8095.123
2 8096.234
3 8097.345
4 8098.456
5 8099.567
Name: x, dtype: decimal128(23, 3)[pyarrow]
However, if we change the precision and scale to use more bits things get hairy:
In [13]: df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa.decimal128(precision=38, scale=3)))
In [14]: df['x'] + 1
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
...
ArrowInvalid: Decimal precision out of range [1, 38]: 39
We know from above where df['x'] + 1
worked and returned values in decimal128(23, 3)[pyarrow]
that the results of adding 1
will fit in the decimal128(precision=38, scale=3)
(right? correct me if I'm wrong 🙏🏻 ) . But there appears to be a problem in where the spec of the decimals is not getting managed correctly.
My guess that there are work arounds for the issues above - plus possibly even fixes upstream in pandas. However, this extra complexity for vectorization and smaller memory footprint might not be worth it at this time for us. My plan is to push ahead with decimal.Decimal
in the DF and see what walls we collide with 🙏🏻 .
I must admit, that I had not tried to do too much with arrow decimals. Your notes here are pretty disappointing! If you use dtype="decimal[3]" from this package, describe()
at least does work. Also, we are significantly faster:
In [23]: df = pd.DataFrame({"a": [0, 1] * 100000}, dtype=pd.ArrowDtype(pa.decimal128(precision=7, scale=3)))
In [26]: %timeit df.a.std()
1.28 ms ± 7.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [27]: df = pd.DataFrame({"a": [0, 1] * 100000}, dtype="decimal[3]")
In [28]: %timeit df.a.std()
340 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
As for precision, you would expect maybe to be bounded by integers up to 2**63 (~19 decimal places for numbers near 1). I don't know why arrow has both precision and scale...
For a conclusion, decimal./Decimal may solve your problem, but it is orders of magnitude slower.
In [35]: df["a"] = df.a.map(lambda x: decimal.Decimal(str(x)))
In [39]: %timeit df.a.mean() # this is MEAN because std fails
12.9 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Regarding pyarrow being required by pandas, I found this just now: https://github.com/pandas-dev/pandas/issues/52509
This repo currently is a nice little proof-of-concept.
To be a viable package in the pydata realm, it should have
Should we poke pandas directly to consider upstreaming?