hgrecco / pint-pandas

Pandas support for pint
Other
166 stars 40 forks source link

Fix issue where quantify doesn't work on DataFrames containing string columns #217

Closed Nick-Hemenway closed 3 months ago

Nick-Hemenway commented 3 months ago

I'm hoping someone can help me a bit. I've never contributed to an open source project and am not sure what the protocol is or how to run tests so I appreciate your patience. I'd be happy to modify my request as needed.

The simple change I made enables the quantify method on a DataFrame to work even if the DataFrame contains string columns. Currently this works, but there's a unintended side effect. I've included a minimal example below to show what I mean.

import pandas as pd
import pint_pandas

df = pd.DataFrame(
    {
        'power': pd.Series([1.0,2.0,3.0], dtype='pint[W]'),
        'torque': pd.Series([4.0,5.0,6.0], dtype='pint[N*m]'),
        'fruits': pd.Series(['apple', 'pear', 'kiwi'])
    }
)
df1 = df.pint.dequantify().pint.quantify(level=-1)
print(df1.power.pint.m)

The current code would output: 2024-04-02_13-07-44

After the update I made: 2024-04-02_13-12-15

As can be seen, the current code converts the underlying data to an object data type when it should remain float64. This occurs because the current code takes the entire DataFrame and converts it to a 2D array before indexing (at which point the entire array is converted to object datatypes. My modification indexes into individual columns so that the data is never cast to a different underlying type.

andrewgsavage commented 3 months ago

Nice fix! ignore the 3.9 master failure

Add a test to pint_pandas/testsuite/test_issues.py. Make a class/function like https://github.com/hgrecco/pint-pandas/blob/f9d3a66e3b1f069f90dafb8416bc16c0cb411de9/pint_pandas/testsuite/test_issues.py#L55 You can use this PR# as the issue #.

the code in your post is basically the test already, just add an assert to check the two dfs are same

df = pd.DataFrame(
    {
        'power': pd.Series([1.0,2.0,3.0], dtype='pint[W]'),
        'torque': pd.Series([4.0,5.0,6.0], dtype='pint[N*m]'),
        'fruits': pd.Series(['apple', 'pear', 'kiwi'])
    }
)
df1 = df.pint.dequantify().pint.quantify(level=-1)

tm.assert_equal(df1, df)
Nick-Hemenway commented 3 months ago

I've added a new TestIssue217 class and verified that it failed on the old code and works on the updated code. Note that it wasn't as simple as using assert_equal(df1, df) because this passed on the original code. This doesn't capture the issue highlighted in this PR because the float64 and object datatypes are hidden by the PintArray datatypes when checking if the DataFrames are equal

andrewgsavage commented 3 months ago

Looks good, you'll need to run pre-commit run --all-files to get the linter to pass

andrewgsavage commented 3 months ago

I've added a new TestIssue217 class and verified that it failed on the old code and works on the updated code. Note that it wasn't as simple as using assert_equal(df1, df) because this passed on the original code. This doesn't capture the issue highlighted in this PR because the float64 and object datatypes are hidden by the PintArray datatypes when checking if the DataFrames are equal

Out of interest, do you find the lack of visibility of the float64/object dtype making the PintArrays look the same but actually be different an issue?

Nick-Hemenway commented 3 months ago

Yes, I think that might be the underlying issue here (for the testing). The two DataFrames should not pass the "isequal" test but the underlying pint arrays show that they are equal even though they really aren't as the data they contain is of a different type.

Example:

df = pd.DataFrame({'a':[1,2,3], 'b':['apple', 'pear', 'kiwi']})

p1 = PintArray(df.to_numpy()[:,0], 'm') #inner numpy array contains data of type 'object'
p2 = PintArray(np.array([1,2,3], dtype=np.float64), 'm')

p1.equals(p2) #returns True when it should return False
andrewgsavage commented 3 months ago

Yea I think you're right, that's how pandas EA behaves:



>>> p1.data
<NumpyExtensionArray>
[1, 2, 3]
Length: 3, dtype: object
>>> p2.data
<NumpyExtensionArray>
[1.0, 2.0, 3.0]
Length: 3, dtype: float64

>>> p1.data==p2.data
<NumpyExtensionArray>
[True, True, True]
Length: 3, dtype: bool

>>> p1.data.equals(p2.data)
False
andrewgsavage commented 3 months ago

thanks for fixing that

thaeber commented 2 months ago

I just stumbled upon the same problem, but for data frames that also contain numeric columns without units. The current fix only seems to fix the unwanted type conversion to dtype='object' for columns with units, while the same happens with numeric columns without units.

In the following example, the "something_else" column has numerical values, but no associated units:

import pandas as pd
import pint_pandas

df = pd.DataFrame(
    {
        'power': pd.Series([1.0, 2.0, 3.0], dtype='pint[W]'),
        'torque': pd.Series([4.0, 5.0, 6.0], dtype='pint[N*m]'),
        'fruits': pd.Series(['apple', 'pear', 'kiwi']),
        'something_else': pd.Series([7.0, 8.0, 9.0]),
    }
)
df1 = df.pint.dequantify().pint.quantify(level=-1)

print(df.dtypes)

print(df1.dtypes)

1st print statement: grafik

2nd print statement: grafik

This could probably be fixed by changing the "else" branch at code from else df.values[:, i] to else df.iloc[:, i], similar to the "then" branch in the current fix.

andrewgsavage commented 2 months ago

a PR to fix that is welcomed.