hgrecco / pint-pandas

Pandas support for pint
Other
166 stars 40 forks source link

Values dtype #192

Open andrewgsavage opened 1 year ago

andrewgsavage commented 1 year ago

Two PintArrays can have the same dtype, eg 'pint[m]' but have data stored with different dtypes, eg ints and floats. This used to cause issues when np.array was used as the values since you couldn't store nan in the int np.array, but has mostly been fixed since defaulting to using PandasArrays for values. I think there are still minor issues relating to converting and infering the dtype for values.

Should the dtype contain the values dtype, eg 'pint[m][int]' ?

MichaelTiemannOSC commented 1 year ago

A very timely question. I'm running into this exact problem with float vs. UFloat arrays (uncertainties), specifically in the underlying test case from pandas/tests/extension/base/constructors.py and dtype='pint[meter]':

    def test_series_constructor_no_data_with_index(self, dtype, na_value):
        result = pd.Series(index=[1, 2, 3], dtype=dtype)
        expected = pd.Series([na_value] * 3, index=[1, 2, 3], dtype=dtype)
        self.assert_series_equal(result, expected)

Pandas wants to construct the Series by filling in the data parameter with the na_value_for_dtype, calling up to PintType to get the na_value. pint[meter] could have an na_value of np.nan if the underlying ndarray is float, but it needs to be _ufloat_nan if the underlying array containsUFloat elements. I made quite a lot of progress by choosing instead pd.NA, which works for both, and well enough that it passed all my application's testsuite. However, in this particular test case, it fails due to the fact that pd.NA is not something familiar to the NumPy world (why is it building a Numpy array rather than a PandasArray?). It feels totally wrong to import pd.NA into pint/facets/numpy/quantity.py, so that means the na_value we choose has to behave in a numpy-ish way (and give the right answer to self._magnitude.ndim without special hooks).

One of the big promises of EAs is consistent pd.NA handling. I wonder whether it's appropriate to broach that with pint's NumPy facet, or if there's a better way to get Pandas to stay in the PandasArray realm and not touch the Numpy facet at all. The codepath that gets us there is the test is_list_like(data) in pandas/core/series.py/Series.init which tests the ndim attribute. That calls pint/facets/plain/quantity.py PlainQuantity.ndim, which delegates to pint/facets/numpy/quantity.py NumpyQuantity.ndim when it looks for self.magnitude.ndim. I created a special handler for ndim in NumptyQuantity...perhaps I should have done so in PlainQuantity...

MichaelTiemannOSC commented 1 year ago

So I created the special case for PlainQuantity and fixed a few erroneous np.nan vs. pd.NA problems in the test cases (following the patterns of pandas/tests/extension/test_integer.py and pandas/tests/extension/test_floating.py, which use pd.NA instead of np.nan for missing data), and lots more test cases are working. But...

pandas/core/internals/managers.py has this code in fast_xs:

        dtype = interleaved_dtype([blk.dtype for blk in self.blocks])

        n = len(self)

        # GH#46406                                                                                                                                                                                                                                                                                                      
        immutable_ea = isinstance(dtype, SparseDtype)

        if isinstance(dtype, ExtensionDtype) and not immutable_ea:
            cls = dtype.construct_array_type()
            result = cls._empty((n,), dtype=dtype)

And in this case it creates an empty array by way of PintArray._from_sequence using an empty list. With no visible master_scalar, this defaults to creating an array for floats, not UFloats. This is where a dtype with a magnitude type would come in handy...

andrewgsavage commented 1 year ago

so to set this up in pint-pandas, of the top of my head it would need:

I think a module flag like 'cache_values_dtype' could be used so those that don't need this don't see the dtype as it makes using .astype quite wordy

MichaelTiemannOSC commented 1 year ago

Just to mention I've got things back to this point:

====================================================================================================================================================================== short test summary info ======================================================================================================================================================================
FAILED test_pandas_extensiontests.py::TestMethods::test_factorize[float] - AssertionError: numpy array are different
FAILED test_pandas_extensiontests.py::TestMethods::test_factorize[int] - AssertionError: numpy array are different
FAILED test_pandas_extensiontests.py::TestMethods::test_factorize[complex128] - AssertionError: numpy array are different
FAILED test_pandas_extensiontests.py::TestMethods::test_factorize_empty[float] - AssertionError: numpy array are different
FAILED test_pandas_extensiontests.py::TestMethods::test_factorize_empty[int] - AssertionError: numpy array are different
FAILED test_pandas_extensiontests.py::TestMethods::test_factorize_empty[complex128] - AssertionError: numpy array are different
FAILED test_pandas_extensiontests.py::TestMethods::test_diff[float-1] - TypeError: unsupported operand type(s) for -: 'AffineScalarFunc' and 'NAType'
FAILED test_pandas_extensiontests.py::TestMethods::test_diff[float--2] - TypeError: unsupported operand type(s) for -: 'AffineScalarFunc' and 'NAType'
FAILED test_pandas_extensiontests.py::TestMethods::test_diff[int-1] - TypeError: unsupported operand type(s) for -: 'AffineScalarFunc' and 'NAType'
FAILED test_pandas_extensiontests.py::TestMethods::test_diff[int--2] - TypeError: unsupported operand type(s) for -: 'AffineScalarFunc' and 'NAType'
FAILED test_pandas_extensiontests.py::TestMethods::test_diff[complex128-1] - TypeError: unsupported operand type(s) for -: 'AffineScalarFunc' and 'NAType'
FAILED test_pandas_extensiontests.py::TestMethods::test_diff[complex128--2] - TypeError: unsupported operand type(s) for -: 'AffineScalarFunc' and 'NAType'
FAILED test_pandas_extensiontests.py::TestArithmeticOps::test_divmod_series_array[float] - TypeError: ufunc 'divmod' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
FAILED test_pandas_extensiontests.py::TestArithmeticOps::test_divmod_series_array[int] - TypeError: ufunc 'divmod' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
FAILED test_pandas_extensiontests.py::TestArithmeticOps::test_divmod[float] - TypeError: ufunc 'divmod' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
FAILED test_pandas_extensiontests.py::TestArithmeticOps::test_divmod[int] - TypeError: ufunc 'divmod' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
FAILED test_pandas_extensiontests.py::TestReshaping::test_merge_on_extension_array_duplicates[float] - AssertionError: DataFrame are different
FAILED test_pandas_extensiontests.py::TestReshaping::test_merge_on_extension_array_duplicates[int] - AssertionError: DataFrame are different
FAILED test_pandas_extensiontests.py::TestReshaping::test_merge_on_extension_array_duplicates[complex128] - AssertionError: DataFrame are different
FAILED test_pandas_extensiontests.py::TestSetitem::test_setitem_invalid[float] - Failed: DID NOT RAISE (<class 'ValueError'>, <class 'TypeError'>)
FAILED test_pandas_extensiontests.py::TestSetitem::test_setitem_invalid[int] - Failed: DID NOT RAISE (<class 'ValueError'>, <class 'TypeError'>)
FAILED test_pandas_extensiontests.py::TestSetitem::test_setitem_invalid[complex128] - Failed: DID NOT RAISE (<class 'ValueError'>, <class 'TypeError'>)
FAILED test_pandas_extensiontests.py::TestSetitem::test_setitem_2d_values[float] - TypeError: can't convert an affine function (<class 'uncertainties.core.AffineScalarFunc'>) to float; use x.nominal_value
FAILED test_pandas_extensiontests.py::TestSetitem::test_setitem_2d_values[int] - TypeError: can't convert an affine function (<class 'uncertainties.core.AffineScalarFunc'>) to float; use x.nominal_value
FAILED test_pandas_extensiontests.py::TestSetitem::test_setitem_2d_values[complex128] - TypeError: can't convert an affine function (<class 'uncertainties.core.AffineScalarFunc'>) to float; use x.nominal_value
FAILED test_pandas_extensiontests.py::TestSetitem::test_setitem_scalar_key_sequence_raise[float] - Failed: DID NOT RAISE <class 'ValueError'>
FAILED test_pandas_extensiontests.py::TestSetitem::test_setitem_scalar_key_sequence_raise[int] - Failed: DID NOT RAISE <class 'ValueError'>
=============================================================================================================================================== 27 failed, 1281 passed, 1 skipped, 45 xfailed, 494 warnings in 26.75s ===============================================================================================================================================

and my own test cases are passing. I'll update my PRs shortly for review.

MichaelTiemannOSC commented 1 year ago

Commits to pint and pint-pandas are pushed. For the record, the small adjustments I've also made to pandas:

(itr_env) iMac-Pro:pandas-dev michael$ git diff --minimal
diff --git a/pandas/core/arrays/masked.py b/pandas/core/arrays/masked.py
index 15c485cbb1..15f80b5501 100644
--- a/pandas/core/arrays/masked.py
+++ b/pandas/core/arrays/masked.py
@@ -870,7 +870,8 @@ class BaseMaskedArray(OpsMixin, ExtensionArray):
         # we only fill where the indexer is null
         # not existing missing values
         # TODO(jreback) what if we have a non-na float as a fill value?
-        if allow_fill and notna(fill_value):
+        # NaN with uncertainties is scalar but does not register as `isna`, so use fact that NaN != NaN
+        if allow_fill and notna(fill_value) and fill_value==fill_value:
             fill_mask = np.asarray(indexer) == -1
             result[fill_mask] = fill_value
             mask = mask ^ fill_mask
diff --git a/pandas/core/groupby/groupby.py b/pandas/core/groupby/groupby.py
index 1a17fef071..98e9c53c37 100644
--- a/pandas/core/groupby/groupby.py
+++ b/pandas/core/groupby/groupby.py
@@ -3080,7 +3080,10 @@ class GroupBy(BaseGroupBy[NDFrameT]):
                 """Helper function for first item that isn't NA."""
                 arr = x.array[notna(x.array)]
                 if not len(arr):
-                    return np.nan
+                    nan_arr = x.array[isna(x.array)]
+                    if not len(nan_arr):
+                        return np.nan
+                    return nan_arr[0]
                 return arr[0]

             if isinstance(obj, DataFrame):
diff --git a/pandas/core/groupby/ops.py b/pandas/core/groupby/ops.py
index f0e4484f69..8b7f8e1aee 100644
--- a/pandas/core/groupby/ops.py
+++ b/pandas/core/groupby/ops.py
@@ -587,7 +587,7 @@ class BaseGrouper:

     def get_iterator(
         self, data: NDFrameT, axis: AxisInt = 0
-    ) -> Iterator[tuple[Hashable, NDFrameT]]:
+    ) -> Iterator[tuple[Hashable, NDFrameT]]:  # Doesn't work with non-hashable EA types
         """
         Groupby iterator

diff --git a/pandas/tests/extension/test_boolean.py b/pandas/tests/extension/test_boolean.py
index c9fa28a507..18341b2323 100644
--- a/pandas/tests/extension/test_boolean.py
+++ b/pandas/tests/extension/test_boolean.py
@@ -28,7 +28,7 @@ from pandas.tests.extension import base

 def make_data():
-    return [True, False] * 4 + [np.nan] + [True, False] * 44 + [np.nan] + [True, False]
+    return [True, False] * 4 + [pd.NA] + [True, False] * 44 + [pd.NA] + [True, False]

 @pytest.fixture
@@ -48,7 +48,7 @@ def data_for_twos(dtype):

 @pytest.fixture
 def data_missing(dtype):
-    return pd.array([np.nan, True], dtype=dtype)
+    return pd.array([pd.NA, True], dtype=dtype)

 @pytest.fixture
@@ -58,7 +58,7 @@ def data_for_sorting(dtype):

 @pytest.fixture
 def data_missing_for_sorting(dtype):
-    return pd.array([True, np.nan, False], dtype=dtype)
+    return pd.array([True, pd.NA, False], dtype=dtype)

 @pytest.fixture
@@ -76,7 +76,7 @@ def na_value():
 def data_for_grouping(dtype):
     b = True
     a = False
-    na = np.nan
+    na = pd.NA
     return pd.array([b, b, na, na, a, a, b], dtype=dtype)

@@ -147,7 +147,7 @@ class TestArithmeticOps(base.BaseArithmeticOpsTests):
                 expected = expected.astype("Float64")
             if op_name == "__rpow__":
                 # for rpow, combine does not propagate NaN
-                expected[result.isna()] = np.nan
+                expected[result.isna()] = pd.NA
             self.assert_equal(result, expected)
         else:
             with pytest.raises(exc):

I'll create a Pandas PR when there's enough chain on the gears from the Pint and Pint Pandas PRs.

MichaelTiemannOSC commented 1 year ago

Pandas PR was submitted last week: https://github.com/pandas-dev/pandas/pull/53970

MichaelTiemannOSC commented 11 months ago

Pandas 2.1 is going to be much better for Pint-Pandas. All my tests now work with no mods needed to Pandas, and I've reverted much of the complexity I originally thought I needed to handle uncertainties. The simplifications were possible because I re-tested an initial assumption about compatibility/incompatibility of np.nan and uncertain NaNs (such as ufloat(np.nan, 0)). The net result is that I could use np.nan as a NaN in a UFloat array (without needing to promote to UFloat) and I didn't need to use pd.NA as my NaN for UFloats. I created test cases for pint-pandas that use both np.nan and ufloat(np.nan, 0) as NaN values in the test cases, and they all work. (Complex numbers are not compatible with the uncertainties package, so I cannot test with complex NaNs.)

Here's a comment I pulled from the factorize code that gives a good entry into the considerations of multiple NA-types (np.nan, pd.NA, None, but also {} for json):

    # factorize can now handle differentiating various types of null values.                                                                                                                                                                         
    # These can only occur when the array has object dtype.                                                                                                                                                                       
    # However, for backwards compatibility we only use the null for the                                                                                                                                                                              
    # provided dtype. This may be revisited in the future, see GH#48476.                                                                                                                                                                             

From that I learned that the safe way to think about it is "you can do what you want, but you should know that when Pandas does what it wants, it's going to canonicalize NA values to the na_value of the EA dtype". So...pick your NA dtype carefully and be ready for it to show up. And from what I have seen, Pandas is now well-behaved as it manages the interplay between various types of arrays feeding PintArray (IntegerArray, FloatingArray which can both have pd.NA values, and ndarray, which can have np.nans) and the operation of PintArray functions.

I don't think Pint-Pandas needs to implement a second type in brackets.