hgrecco / pint-pandas

Pandas support for pint
Other
172 stars 42 forks source link

groupby misses opportunity to create PintArray(s) #142

Closed MichaelTiemannOSC closed 1 year ago

MichaelTiemannOSC commented 1 year ago

I have observed that when I do a grouping operation on a column of quantities, the resulting pd.Sequence is the most pint-unfriendly result. The purpose of this issue is to discuss the question as to whether it's possible (and if so, how) to create sequences that use PintArrays for their data when possible.

MichaelTiemannOSC commented 1 year ago

There is an infer_objects() method in Pandas which attempts to promote object types to more specific types. But in the source it is marked @final. Perhaps something like infer_PintArray() which would convert a suitably homogeneous and unitized Series into a PintArray if possible. It could similarly attempt to do the same either column-wise or row-wise for DataFrames. Thoughts?

andrewgsavage commented 1 year ago

can this be closed now?

MichaelTiemannOSC commented 1 year ago

Good question. I wrote these test cases illustrating two potential ways to decide whether or not this is something we want to fix:

class TestIssue142(BaseExtensionTests):
    @pytest.mark.xfail(run=True, reason="groupby does not coalesce PintTypes when it groups by pint Units")
    def test_Unit_groupby(self):
        index_m = pd.Index([0, 2, 4, 7])
        index_kg = pd.Index([1, 3, 5, 6, 8, 9])
        data_m = PintArray.from_1darray_quantity(index_m.values * ureg.m)
        data_kg = PintArray.from_1darray_quantity(index_kg.values * ureg.kg)
        results_expected = {ureg.m: data_m, ureg.kg: data_kg}
        df_m = pd.DataFrame({"ureg unit": ureg.m, "qty": data_m})
        df_kg = pd.DataFrame({"ureg unit": ureg.kg, "qty": data_kg})
        df_mixed = pd.concat([df_m, df_kg])
        df_list_grouped = list(df_mixed.groupby(by="ureg unit"))
        for ureg_unit, df in df_list_grouped:
            result = df.qty
            expected = results_expected[ureg_unit]
            tm.assert_series_equal(result, expected)

    @pytest.mark.xfail(run=True, reason="groupby does not coalesce PintTypes when it sees them")
    def test_str_groupby(self):
        index_m = pd.Index([0, 2, 4, 7])
        index_kg = pd.Index([1, 3, 5, 6, 8, 9])
        data_m = PintArray.from_1darray_quantity(index_m.values * ureg.m)
        data_kg = PintArray.from_1darray_quantity(index_kg.values * ureg.kg)
        results_expected = {str(ureg.m): data_m, str(ureg.kg): data_kg}
        df_m = pd.DataFrame({"ureg unit": str(ureg.m), "qty": data_m})
        df_kg = pd.DataFrame({"ureg unit": str(ureg.kg), "qty": data_kg})
        df_mixed = pd.concat([df_m, df_kg])
        df_list_grouped = list(df_mixed.groupby(by="ureg unit"))
        for ureg_unit, df in df_list_grouped:
            result = df.qty
            expected = results_expected[ureg_unit]
            tm.assert_series_equal(result, expected)

The first imagines that if groupby were to look into its grouping_vector it might find a friendly attribute that would help it understand generally how to use EA mechanics to get the right answer.

The second simplifies the groupby indexing machinery, but then requires that it look down into object dtypes to find EA dtypes and then be smart with those. It does something similar with date/time groupers (dtype.kind in "mM") and Categorical. It certainly could peep object dtypes and see if they have EA values all of the same type underneath.

MichaelTiemannOSC commented 1 year ago

See https://github.com/pandas-dev/pandas/pull/54543 for a prototype implementation that passes test_str_groupby.

andrewgsavage commented 1 year ago

https://github.com/pandas-dev/pandas/pull/51166 it may be that pint-pandas should implement _groupby_op to return a PintArray

andrewgsavage commented 1 year ago

groupby now does return PintArrays:

df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': pint_pandas.PintArray([380., 370., 24., 26.],"m")})
df_ = df.groupby(['Animal']).mean()
df_.dtypes

Your example falls over at df_mixed where it becomes object dtype, so it's the infering dtype issue again.