Closed MichaelTiemannOSC closed 1 year ago
is this still an issue? it looks as expected when I run it.
It's still a problem. Here's the ugly dataframe, weighted incorrectly:
>>> xx.a / xx.a.sum()
/Users/michael/opt/miniconda3/envs/pandas_2/lib/python3.11/site-packages/pandas/core/construction.py:580: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
data = np.array(data, copy=copy)
0 0.998001
1 1.996002
2 2.994003
Name: a, dtype: float64
To pint's credit, it warns that the units are being stripped, and the wrong answer is a good proof-point as to why stripping units is dangerous. Here's the code that does the weighting correctly:
>>> xx.a.astype('pint[km]') / xx.a.sum()
0 0.99800100398901
1 0.00199600200797802
2 2.99400301196703e-06
Name: a, dtype: pint[dimensionless]
An idea solution would be that when we divide a series by a quantity, we try to push the units of the quantity into the series first, then divide, rather than throwing up our hands about the fact that we don't have a PintArray, stripping the units, and producing a wrong result. Now that I'm much more familiar with the code I could chase down whether that's remotely practical or not. But that's the crux of this issue.
Perhaps this idea could also be used when arithmetic operators encounter series of type object
that contain EA elements that could actually be turned into an EA:
>>> xx.a / xx.a.sum() /Users/michael/opt/miniconda3/envs/pandas_2/lib/python3.11/site-packages/pandas/core/construction.py:580: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray. data = np.array(data, copy=copy) 0 0.998001 1 1.996002 2 2.994003 Name: a, dtype: float64
That's an series containing an object numpy array divided by a quantity, so isn't touching pint-pandas. Nothing we can do there. Can this issue be closed since it isn't involving pint-pandas?
Perhaps this idea could also be used when arithmetic operators encounter series of type
object
that contain EA elements that could actually be turned into an EA:
in the example above you don't use a PintArray which makes that difficult as pandas won;t be able to use an EA method to know what object to return. it would need a more generic infer from object method, like is an issue for the constructor https://github.com/pandas-dev/pandas/issues/27995
On second thought, this really is not a problem with Pint-Pandas, because it doesn't involve PintArray. This is a very real problem with Pandas, which aggressively choses to strip units and warn rather than trying to align units and get the right answer. I'm going to close it because it's not a Pint-Pandas issue.
I've been chasing a bug that I expected pint to solve for me, not create for me. Here's the test case:
First off, I understand that things are ugly when dataframes are full of heterogeneous types. but as can be seen, we can sum such things correctly ("our ugly dataframe, summed (correctly)"). That's a problem I expect pint to solve, and it solves it. Bravo!
Now I want to use that sum to normalize my ugly dataframe, and all hell breaks loose, because the units of the numerator are dropped when diving by the denominator. This is surprising, because pint does check for unit compatibility, it's just that after it does it's check, it drops the units!
The behavior that would make me happiest is if the unit of the denominator could be used to normalize the units of the numerator before creating a dimensionless result. But frankly, it's OK to push everything through base_units if that's how it's got to be: the correct dimensionless result is correct no matter the compatible intermediate step.