`groupby` won't work in `pandas` objects with `pint` datatype

rwijtvliet commented 2 years ago

Issue

As the title says. Here is an example:

import pandas as pd
import numpy as np
import pint
import pint_pandas

# For reference: groupby on Series without units.
# -----------------------------------------------

# Input:
s_plain = pd.Series(np.random.rand(4), index=['a', 'b'] * 2)
# a    0.032244
# b    0.073113
# a    0.953453
# b    0.821208
# dtype: float64

# Output:
s_plain.groupby(level=0, axis=0).mean()
# a    0.492848
# b    0.447161
# dtype: float64

# Issue: groupby on Series with pint dtype.
# -----------------------------------------

# Input:
s_units = s_plain.astype('pint[MWh]')
# a    0.032243727749885376
# b     0.07311299213301359
# a      0.9534530925540614
# b      0.8212084539670164
# dtype: pint[megawatt_hour]

# Won't work:
s_units.groupby(level = 0, axis=0).mean() # DataError: No numeric types to aggregate

I have been looking for work-arounds which I share here, maybe it's a good starting point for a bugfix:

Workaround part 1

Using np.mean instead. Not ideal, because returns Series of quantities, i.e. Series without pint dtype.

s_units.groupby(level= 0, axis=0).apply(np.mean) # using np.mean
# a    0.4928484101519734 megawatt_hour
# b     0.447160723050015 megawatt_hour
# dtype: object

Workaround part 2

Re-apply the .astype method to force the unit. Also not ideal, as the index is now lost.

s_units.groupby(level= 0, axis=0).apply(np.mean).astype('pint[MWh]')
s_units.groupby(level= 0, axis=0).apply(np.mean).astype(f'pint[{s_units.pint.units}]') # alternative, agnostic about units
# 0    0.4928484101519734
# 1     0.447160723050015
# dtype: pint[megawatt_hour]

To summarize: the issues are:

Normal .mean method won't work on a SeriesGroupbyObject;
Reapplying .astype on a Series of Quantities seems to lose the index... though that seems a separate (and more urgent) issue.

I'm using pint version 0.18.

MichaelTiemannOSC commented 2 years ago

I don't see the data error you are reporting (running in a Jupyter Notebook based on python 3.8)...

s_units.groupby(level = 0, axis=0).mean()
# a    0.4149818817797141
# b     0.748910249975834
# dtype: pint[megawatt_hour]

My version of pint_pandas is 0.2. I'm also using pint 0.18.

MichaelTiemannOSC commented 2 years ago

Here's an example of how things are breaking for me. When the dataframe column has a proper pint type for the column, then everything's OK. When the column has units that come from calculations, even when the entire column is consistent with respect to the units it contains, the column is ignored by the groupby. I wonder if there's a renormalization idiom I just don't yet know.

import numpy as np
import pandas as pd

from pint_pandas import PintArray, PintType
from openscm_units import unit_registry
PintType.ureg = unit_registry
ureg = unit_registry
Q_ = ureg.Quantity
# PA_ = PintArray

ureg.define("CO2e = CO2 = CO2eq = CO2_eq")

s1_plain = pd.Series(np.random.rand(4), index=['Copmany A', 'Company B'] * 2)
s2_plain = pd.Series(np.random.rand(4), index=['Copmany A', 'Company B'] * 2)

s1_units = s1_plain.astype('pint[MWh]')
s2_units = s2_plain.astype('pint[t CO2]')

s1_units.groupby(level = 0, axis=0).mean()
s2_units.groupby(level = 0, axis=0).mean()

d_plain = pd.concat([s1_plain, s2_plain], axis=1).reset_index().rename(columns={'index':'Company', 0:'Generation', 1:'Emissions'})
display(d_plain)

d_units = pd.concat([s1_units, s2_units], axis=1).reset_index().rename(columns={'index':'Company', 0:'Generation', 1:'Emissions'})
display(d_units)

display(d_units['Generation'])
display(d_units['Emissions'])

d_units['Intensity'] = np.nan
d_units.loc[[0, 2], 'Intensity'] = d_units.Emissions / d_units.Generation
d_units.loc[[1, 3], 'Intensity'] = d_units.Emissions / d_units.Generation
display(d_units)

display(d_units.Intensity)

display(d_units.groupby('Company').mean())    # This will drop Intensity from the columns

d_units.Intensity = d_units.Intensity.astype('pint[t CO2/MWh]')

display(d_units.groupby('Company').mean())    # Intensity mean is calculated and displayed

andrewgsavage commented 2 years ago

Workaround part 1

What version of pandas are you using? There was a fix in a fairly recent pandas version which allows PintArrays to be recongnised as numeric. I suspect this is why it works for MichaelTiemannOSC.

Workaround part 2

Nothing about the index is stored in the PintArray so again I think this may be a pandas version thing.

andrewgsavage commented 2 years ago

@MichaelTiemannOSC, in these steps you create an Intensity column but it is not a PintArray - the dtype is object, it needs to be pint[...]. You're initialising it with np.nan, so it uses a float64 dtype, then adding pint quantities, so it changes to object dtype. d_units['Intensity'] = np.nan d_units.loc[[0, 2], 'Intensity'] = d_units.Emissions / d_units.Generation d_units.loc[[1, 3], 'Intensity'] = d_units.Emissions / d_units.Generation

You're using the renormalization idiom I'd use d_units.Intensity.astype('pint[t CO2/MWh]')

If you were to do this: d_units['Intensity'] = d_units.Emissions / d_units.Generation then d_units.Emissions / d_units.Generation returns a Series with a pint dtype, so behaves as expected.

andrewgsavage commented 2 years ago

closing this as it was fixed a few versions of pandas ago

hgrecco / pint-pandas