hgrecco / pint-pandas

Pandas support for pint
Other
166 stars 41 forks source link

DataFrame reduction methods broken #174

Closed coroa closed 11 months ago

coroa commented 1 year ago

With the versions:

In [20]: pint_pandas.show_versions()
{'numpy': '1.24.2',
 'pandas': '1.5.3',
 'pint': '0.20.1',
 'pint_pandas': '0.4.dev32+gc58a7fc'}

I am unable to run any dataframe-wise aggregation functions along axis=0:

In [19]: pd.DataFrame([[0, 1, 2], [3, 4, 5]]).astype("pint[m]").sum()
/Users/coroa/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/internals/blocks.py:369: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  res_values = np.array([[result]])
---------------------------------------------------------------------------
DimensionalityError                       Traceback (most recent call last)
File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pint/facets/plain/quantity.py:702, in PlainQuantity.__int__(self)
    701     return int(self._convert_magnitude_not_inplace(UnitsContainer()))
--> 702 raise DimensionalityError(self._units, "dimensionless")

DimensionalityError: Cannot convert from 'meter' to 'dimensionless'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[19], line 1
----> 1 pd.DataFrame([[0, 1, 2], [3, 4, 5]]).astype("pint[m]").sum()

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/generic.py:11797, in NDFrame._add_numeric_operations.<locals>.sum(self, axis, skipna, level, numeric_only, min_count, **kwargs)
  11777 @doc(
  11778     _num_doc,
  11779     desc="Return the sum of the values over the requested axis.\n\n"
   (...)
  11795     **kwargs,
  11796 ):
> 11797     return NDFrame.sum(
  11798         self, axis, skipna, level, numeric_only, min_count, **kwargs
  11799     )

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/generic.py:11501, in NDFrame.sum(self, axis, skipna, level, numeric_only, min_count, **kwargs)
  11492 def sum(
  11493     self,
  11494     axis: Axis | None = None,
   (...)
  11499     **kwargs,
  11500 ):
> 11501     return self._min_count_stat_function(
  11502         "sum", nanops.nansum, axis, skipna, level, numeric_only, min_count, **kwargs
  11503     )

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/generic.py:11483, in NDFrame._min_count_stat_function(self, name, func, axis, skipna, level, numeric_only, min_count, **kwargs)
  11467     warnings.warn(
  11468         "Using the level keyword in DataFrame and Series aggregations is "
  11469         "deprecated and will be removed in a future version. Use groupby "
   (...)
  11472         stacklevel=find_stack_level(),
  11473     )
  11474     return self._agg_by_level(
  11475         name,
  11476         axis=axis,
   (...)
  11480         numeric_only=numeric_only,
  11481     )
> 11483 return self._reduce(
  11484     func,
  11485     name=name,
  11486     axis=axis,
  11487     skipna=skipna,
  11488     numeric_only=numeric_only,
  11489     min_count=min_count,
  11490 )

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/frame.py:10856, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  10852 ignore_failures = numeric_only is None
  10854 # After possibly _get_data and transposing, we are now in the
  10855 #  simple case where we can use BlockManager.reduce
> 10856 res, _ = df._mgr.reduce(blk_func, ignore_failures=ignore_failures)
  10857 out = df._constructor(res).iloc[0]
  10858 if out_dtype is not None:

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/internals/managers.py:1569, in BlockManager.reduce(self, func, ignore_failures)
   1567 res_blocks: list[Block] = []
   1568 for blk in self.blocks:
-> 1569     nbs = blk.reduce(func, ignore_failures)
   1570     res_blocks.extend(nbs)
   1572 index = Index([None])  # placeholder

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/internals/blocks.py:369, in Block.reduce(self, func, ignore_failures)
    365     raise
    367 if self.values.ndim == 1:
    368     # TODO(EA2D): special case not needed with 2D EAs
--> 369     res_values = np.array([[result]])
    370 else:
    371     res_values = result.reshape(-1, 1)

ValueError: setting an array element with a sequence.

Similarly df.min(), df.max() and so on are failing.

This might be connected with: https://github.com/hgrecco/pint/issues/1128#issuecomment-979432968

andrewgsavage commented 1 year ago

Can this be closed now? Or we could leave this open until df.min(), df.max() etc are tested.

coroa commented 1 year ago

No, this is still broken under pandas~=1.5, as well as pandas~=2.0. I don't think anything short of a change in the __array__ or __array_func__ implementation of pint would help here.

andrewgsavage commented 1 year ago

What versions was this last working with? I am unsure if it was a pandas change or a pint change that has caused this. I'm leaning towards a pint change as I don't see anything obvious in pandas that would cause it. Could also trying implementing __array_func__ on PintArray

In Block.reduce, pandas 1.4.x there's a try, except clause https://github.com/pandas-dev/pandas/blob/1.4.x/pandas/core/internals/blocks.py#L413 Which is also in 1.5.x https://github.com/pandas-dev/pandas/blob/1.5.x/pandas/core/internals/blocks.py#L360 but not in 2.0. https://github.com/pandas-dev/pandas/blob/2.0.x/pandas/core/internals/blocks.py#L340

coroa commented 1 year ago

Note that it is failing outside the try-except block. result is produced successfully (as a scalar Quantity), but wrapping it in np.array([[result]]) then raises at https://github.com/pandas-dev/pandas/blob/778ab8234059059d77eb5d71442c959f8d15c1e0/pandas/core/internals/blocks.py#L369. I do not know how or if this worked previously.

andrewgsavage commented 1 year ago

I tried several versions which didn't work.

# {'numpy': '1.24.3', 'pandas': '1.4.0', 'pint': '0.21', 'pint_pandas': '0.4'} fails
# {'numpy': '1.24.3', 'pandas': '1.3.0', 'pint': '0.21', 'pint_pandas': '0.4'} fails
# {'numpy': '1.24.3', 'pandas': '1.4.0', 'pint': '0.19', 'pint_pandas': '0.4'} fails
# {'numpy': '1.24.3', 'pandas': '1.4.0', 'pint': '0.19', 'pint_pandas': '0.3'} fails

I note that running pd.DataFrame([[0, 1, 2]]).astype("pint[m]").sum() works

andrewgsavage commented 1 year ago

Looks like pandas is working on this already and should be in the next version! https://github.com/pandas-dev/pandas/pull/52788

MichaelTiemannOSC commented 1 year ago

On it...this has one extraneous change (preserve dtype when converting NA to array of Quantities), but I offer these as essentially the minimal changes needed to work with upcoming Pandas 2.1. (This work work any Pandas < 2.1)

I will be bundling this with a PR that's already slated to land in Pint 0.23 (and other PRs in Pandas, Pint-Pandas, and uncertainties). But if you want to apply and test by itself...

(itr_env) iMac-Pro:pint-pandas michael$ git diff
diff --git a/pint_pandas/pint_array.py b/pint_pandas/pint_array.py
index a445b77..4605704 100644
--- a/pint_pandas/pint_array.py
+++ b/pint_pandas/pint_array.py
@@ -208,8 +208,8 @@ dtypemap = {
     float: pd.Float64Dtype(),
     np.float64: pd.Float64Dtype(),
     np.float32: pd.Float32Dtype(),
-    np.complex128: pd.core.dtypes.dtypes.PandasDtype("complex128"),
-    np.complex64: pd.core.dtypes.dtypes.PandasDtype("complex64"),
+    np.complex128: pd.core.dtypes.dtypes.NumpyEADtype("complex128"),
+    np.complex64: pd.core.dtypes.dtypes.NumpyEADtype("complex64"),
     # np.float16: pd.Float16Dtype(),
 }
 dtypeunmap = {v: k for k, v in dtypemap.items()}
@@ -520,7 +520,10 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
                 # magnitude is in fact an array scalar, which will get rejected by pandas.
                 fill_value = fill_value[()]

-        result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill)
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            # Turn off warning that PandasArray is deprecated for ``take``
+            result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill)

         return PintArray(result, dtype=self.dtype)

@@ -990,7 +993,7 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
         qtys = [
             self._Q(item, self._dtype.units)
             if item is not self.dtype.na_value.m
-            else item
+            else self.dtype.na_value
             for item in self._data
         ]
         with warnings.catch_warnings(record=True):
@@ -1048,7 +1051,7 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
             value = [item.to(self.units).magnitude for item in value]
         return arr.searchsorted(value, side=side, sorter=sorter)

-    def _reduce(self, name, **kwds):
+    def _reduce(self, name, *, skipna: bool = True, keepdims: bool = False, **kwds):
         """
         Return a scalar result of performing the reduction operation.

@@ -1092,14 +1095,18 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):

         if isinstance(self._data, ExtensionArray):
             try:
-                result = self._data._reduce(name, **kwds)
+                result = self._data._reduce(name, skipna=skipna, keepdims=keepdims, **kwds)
             except NotImplementedError:
                 result = functions[name](self.numpy_data, **kwds)

         if name in {"all", "any", "kurt", "skew"}:
             return result
         if name == "var":
+            if keepdims:
+                return PintArray(result, f"pint[({self.units})**2]")
             return self._Q(result, self.units**2)
+        if keepdims:
+            return PintArray(result, self.dtype)
         return self._Q(result, self.units)

     def _accumulate(self, name: str, *, skipna: bool = True, **kwds):
diff --git a/pint_pandas/testsuite/test_issues.py b/pint_pandas/testsuite/test_issues.py
index c288e71..384f84f 100644
--- a/pint_pandas/testsuite/test_issues.py
+++ b/pint_pandas/testsuite/test_issues.py
@@ -202,3 +202,16 @@ def test_issue_139():
     assert np.all(a_m[0:4] == a_cm[0:4])
     for x, y in zip(a_m[4:], a_cm[4:]):
         assert unp.isnan(x) == unp.isnan(y)
+
+class TestIssue174(BaseExtensionTests):
+    def test_sum(self):
+        a = pd.DataFrame([[0, 1, 2], [3, 4, 5]]).astype("pint[m]")
+        row_sum = a.sum(axis=0)
+        expected_1 = pd.Series([3, 5, 7], dtype="pint[m]")
+
+        self.assert_series_equal(row_sum, expected_1)
+
+        col_sum = a.sum(axis=1)
+        expected_2 = pd.Series([3, 12], dtype="pint[m]")
+
+        self.assert_series_equal(col_sum, expected_2)
diff --git a/pint_pandas/testsuite/test_pandas_extensiontests.py b/pint_pandas/testsuite/test_pandas_extensiontests.py
index 7c92751..a638c5b 100644
--- a/pint_pandas/testsuite/test_pandas_extensiontests.py
+++ b/pint_pandas/testsuite/test_pandas_extensiontests.py
@@ -523,7 +523,7 @@ class TestArithmeticOps(base.BaseArithmeticOpsTests):
                 divmod(s, other)

     def _get_exception(self, data, op_name):
-        if data.data.dtype == pd.core.dtypes.dtypes.PandasDtype("complex128"):
+        if data.data.dtype == pd.core.dtypes.dtypes.NumpyEADtype("complex128"):
             if op_name in ["__floordiv__", "__rfloordiv__", "__mod__", "__rmod__"]:
                 return op_name, TypeError
         if op_name in ["__pow__", "__rpow__"]:
@@ -627,6 +627,10 @@ class TestNumericReduce(base.BaseNumericReduceTests):
             expected = expected_m
         assert result == expected

+    @pytest.mark.skip("tests not written yet")
+    def check_reduce_frame(self, ser: pd.Series, op_name: str, skipna: bool):
+        pass
+
     @pytest.mark.parametrize("skipna", [True, False])
     def test_reduce_scaling(
         self, data, all_numeric_reductions, skipna, USE_UNCERTAINTIES
topper-123 commented 12 months ago

Just a small note that https://github.com/pandas-dev/pandas/pull/52788 has been merged in pandas. I think that should help fix this issue in pint-pandas?

topper-123 commented 12 months ago

I agree this should be possible in v2.1 (not released yet), but pint_pandas still needs to implement a keepdims parameter to PintArray._reduce at your end to make it happen, though.

I could take a stab at a PR for this if I can get someone to review the PR. I assume the tests should be added to pint_pandas/testsuite/test_pandas_extensiontests.py in e.g. a new TestReduction class?

MichaelTiemannOSC commented 12 months ago

Hey, that was me :smile:

PR#140 has the keepdims changes. You could split them out and merge them independently. The TestReductions (note 's' at the end) does need to be incorporated. It looks like the Pandas base test case makes up its own data, and that such data is not as trivial as data, so a TestReductions class in Pint-Pandas would take some work and would be welcome!

andrewgsavage commented 12 months ago

yea go for it. it would be good to split Michaels changes out in seperate PRs to make it easier to review.

I was holding out for a rc (which doesn't look like is happening) before taking a look as it takes ages for the CI to build pandas otherwise.

MichaelTiemannOSC commented 11 months ago

I've split out the changes here: https://github.com/hgrecco/pint-pandas/pull/196

MichaelTiemannOSC commented 11 months ago

Pandas 2.1.0rc0 is now available. Could be a good time to sync up with our own release candidates.

andrewgsavage commented 11 months ago

close by #195

pd.DataFrame([[0, 1, 2], [3, 4, 5]]).astype("pint[m]").sum()
0    3.0
1    5.0
2    7.0
dtype: pint[meter]
coroa commented 11 months ago

Amazing, thanks @topper-123 !