hgrecco / pint-pandas

Pandas support for pint
Other
169 stars 42 forks source link

Unexpected sum behavior for compatible units #130

Closed MichaelTiemannOSC closed 1 year ago

MichaelTiemannOSC commented 1 year ago

I've been chasing a bug that I expected pint to solve for me, not create for me. Here's the test case:

import pandas as pd
from pint import UnitRegistry, set_application_registry
from pint_pandas import PintArray, PintType

ureg = UnitRegistry()
set_application_registry(ureg)

Q_ = ureg.Quantity
PA_ = PintArray

print("creating our ugly dataframe")
xx = pd.DataFrame({'a': [Q_(1, 'km'), Q_(2, 'm'), Q_(3, 'mm')]})

print("our ugly dataframe")
print(f"xx = {xx}")

print("our ugly dataframe, summed (correctly)")
print(f"xx.a.sum() = {xx.a.sum()}")

print("our ugly dataframe, weighted (wrongly)")
print(f"xx.a / xx.a.sum() = {xx.a / xx.a.sum()}")

print("our ugly dataframe, weighted (correctly)")
xx_sum = xx.a.sum()
xx_pa_type = f"pint[{xx_sum.u}]"
print(f"xx_pa_type = {xx_pa_type}")
print(f"xx.a.astype('pint[m]') / xx.a.sum() = {xx.a.astype(xx_pa_type) / xx.a.sum()}")

print("our ugly dataframe, weighted (correctly, using imperial intermediate)")
xx_sum = xx.a.sum()
xx_pa_type = 'pint[foot]'
print(f"xx_pa_type = {xx_pa_type}")
print(f"xx.a.astype(xx_pa_type) / xx.a.astype(xx_pa_type).sum() = {xx.a.astype(xx_pa_type) / xx.a.astype(xx_pa_type).sum()}")

First off, I understand that things are ugly when dataframes are full of heterogeneous types. but as can be seen, we can sum such things correctly ("our ugly dataframe, summed (correctly)"). That's a problem I expect pint to solve, and it solves it. Bravo!

Now I want to use that sum to normalize my ugly dataframe, and all hell breaks loose, because the units of the numerator are dropped when diving by the denominator. This is surprising, because pint does check for unit compatibility, it's just that after it does it's check, it drops the units!

The behavior that would make me happiest is if the unit of the denominator could be used to normalize the units of the numerator before creating a dimensionless result. But frankly, it's OK to push everything through base_units if that's how it's got to be: the correct dimensionless result is correct no matter the compatible intermediate step.

andrewgsavage commented 1 year ago

is this still an issue? it looks as expected when I run it.

MichaelTiemannOSC commented 1 year ago

It's still a problem. Here's the ugly dataframe, weighted incorrectly:

>>> xx.a / xx.a.sum()
/Users/michael/opt/miniconda3/envs/pandas_2/lib/python3.11/site-packages/pandas/core/construction.py:580: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  data = np.array(data, copy=copy)
0    0.998001
1    1.996002
2    2.994003
Name: a, dtype: float64

To pint's credit, it warns that the units are being stripped, and the wrong answer is a good proof-point as to why stripping units is dangerous. Here's the code that does the weighting correctly:

>>> xx.a.astype('pint[km]') / xx.a.sum()
0        0.99800100398901
1     0.00199600200797802
2    2.99400301196703e-06
Name: a, dtype: pint[dimensionless]

An idea solution would be that when we divide a series by a quantity, we try to push the units of the quantity into the series first, then divide, rather than throwing up our hands about the fact that we don't have a PintArray, stripping the units, and producing a wrong result. Now that I'm much more familiar with the code I could chase down whether that's remotely practical or not. But that's the crux of this issue.

MichaelTiemannOSC commented 1 year ago

Perhaps this idea could also be used when arithmetic operators encounter series of type object that contain EA elements that could actually be turned into an EA:

https://github.com/pandas-dev/pandas/pull/54543/

andrewgsavage commented 1 year ago
>>> xx.a / xx.a.sum()
/Users/michael/opt/miniconda3/envs/pandas_2/lib/python3.11/site-packages/pandas/core/construction.py:580: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  data = np.array(data, copy=copy)
0    0.998001
1    1.996002
2    2.994003
Name: a, dtype: float64

That's an series containing an object numpy array divided by a quantity, so isn't touching pint-pandas. Nothing we can do there. Can this issue be closed since it isn't involving pint-pandas?

andrewgsavage commented 1 year ago

Perhaps this idea could also be used when arithmetic operators encounter series of type object that contain EA elements that could actually be turned into an EA:

pandas-dev/pandas#54543

in the example above you don't use a PintArray which makes that difficult as pandas won;t be able to use an EA method to know what object to return. it would need a more generic infer from object method, like is an issue for the constructor https://github.com/pandas-dev/pandas/issues/27995

MichaelTiemannOSC commented 1 year ago

On second thought, this really is not a problem with Pint-Pandas, because it doesn't involve PintArray. This is a very real problem with Pandas, which aggressively choses to strip units and warn rather than trying to align units and get the right answer. I'm going to close it because it's not a Pint-Pandas issue.