Closed rwijtvliet closed 2 years ago
I don't see the data error you are reporting (running in a Jupyter Notebook based on python 3.8)...
s_units.groupby(level = 0, axis=0).mean()
# a 0.4149818817797141
# b 0.748910249975834
# dtype: pint[megawatt_hour]
My version of pint_pandas is 0.2. I'm also using pint 0.18.
Here's an example of how things are breaking for me. When the dataframe column has a proper pint type for the column, then everything's OK. When the column has units that come from calculations, even when the entire column is consistent with respect to the units it contains, the column is ignored by the groupby. I wonder if there's a renormalization idiom I just don't yet know.
import numpy as np
import pandas as pd
from pint_pandas import PintArray, PintType
from openscm_units import unit_registry
PintType.ureg = unit_registry
ureg = unit_registry
Q_ = ureg.Quantity
# PA_ = PintArray
ureg.define("CO2e = CO2 = CO2eq = CO2_eq")
s1_plain = pd.Series(np.random.rand(4), index=['Copmany A', 'Company B'] * 2)
s2_plain = pd.Series(np.random.rand(4), index=['Copmany A', 'Company B'] * 2)
s1_units = s1_plain.astype('pint[MWh]')
s2_units = s2_plain.astype('pint[t CO2]')
s1_units.groupby(level = 0, axis=0).mean()
s2_units.groupby(level = 0, axis=0).mean()
d_plain = pd.concat([s1_plain, s2_plain], axis=1).reset_index().rename(columns={'index':'Company', 0:'Generation', 1:'Emissions'})
display(d_plain)
d_units = pd.concat([s1_units, s2_units], axis=1).reset_index().rename(columns={'index':'Company', 0:'Generation', 1:'Emissions'})
display(d_units)
display(d_units['Generation'])
display(d_units['Emissions'])
d_units['Intensity'] = np.nan
d_units.loc[[0, 2], 'Intensity'] = d_units.Emissions / d_units.Generation
d_units.loc[[1, 3], 'Intensity'] = d_units.Emissions / d_units.Generation
display(d_units)
display(d_units.Intensity)
display(d_units.groupby('Company').mean()) # This will drop Intensity from the columns
d_units.Intensity = d_units.Intensity.astype('pint[t CO2/MWh]')
display(d_units.groupby('Company').mean()) # Intensity mean is calculated and displayed
What version of pandas are you using? There was a fix in a fairly recent pandas version which allows PintArrays to be recongnised as numeric. I suspect this is why it works for MichaelTiemannOSC.
Nothing about the index is stored in the PintArray so again I think this may be a pandas version thing.
@MichaelTiemannOSC, in these steps you create an Intensity column but it is not a PintArray - the dtype is object, it needs to be pint[...]. You're initialising it with np.nan, so it uses a float64 dtype, then adding pint quantities, so it changes to object dtype.
d_units['Intensity'] = np.nan d_units.loc[[0, 2], 'Intensity'] = d_units.Emissions / d_units.Generation d_units.loc[[1, 3], 'Intensity'] = d_units.Emissions / d_units.Generation
You're using the renormalization idiom I'd use
d_units.Intensity.astype('pint[t CO2/MWh]')
If you were to do this:
d_units['Intensity'] = d_units.Emissions / d_units.Generation
then d_units.Emissions / d_units.Generation
returns a Series with a pint dtype, so behaves as expected.
closing this as it was fixed a few versions of pandas ago
Issue
As the title says. Here is an example:
I have been looking for work-arounds which I share here, maybe it's a good starting point for a bugfix:
Workaround part 1
Using
np.mean
instead. Not ideal, because returns Series of quantities, i.e. Series without pint dtype.Workaround part 2
Re-apply the
.astype
method to force the unit. Also not ideal, as the index is now lost.To summarize: the issues are:
.mean
method won't work on aSeriesGroupbyObject
;.astype
on a Series of Quantities seems to lose the index... though that seems a separate (and more urgent) issue.I'm using
pint
version0.18
.