Closed Nick-Hemenway closed 3 months ago
Nice fix! ignore the 3.9 master failure
Add a test to pint_pandas/testsuite/test_issues.py. Make a class/function like https://github.com/hgrecco/pint-pandas/blob/f9d3a66e3b1f069f90dafb8416bc16c0cb411de9/pint_pandas/testsuite/test_issues.py#L55 You can use this PR# as the issue #.
the code in your post is basically the test already, just add an assert to check the two dfs are same
df = pd.DataFrame(
{
'power': pd.Series([1.0,2.0,3.0], dtype='pint[W]'),
'torque': pd.Series([4.0,5.0,6.0], dtype='pint[N*m]'),
'fruits': pd.Series(['apple', 'pear', 'kiwi'])
}
)
df1 = df.pint.dequantify().pint.quantify(level=-1)
tm.assert_equal(df1, df)
I've added a new TestIssue217
class and verified that it failed on the old code and works on the updated code. Note that it wasn't as simple as using assert_equal(df1, df)
because this passed on the original code. This doesn't capture the issue highlighted in this PR because the float64 and object datatypes are hidden by the PintArray datatypes when checking if the DataFrames are equal
Looks good, you'll need to run pre-commit run --all-files
to get the linter to pass
I've added a new
TestIssue217
class and verified that it failed on the old code and works on the updated code. Note that it wasn't as simple as usingassert_equal(df1, df)
because this passed on the original code. This doesn't capture the issue highlighted in this PR because the float64 and object datatypes are hidden by the PintArray datatypes when checking if the DataFrames are equal
Out of interest, do you find the lack of visibility of the float64/object dtype making the PintArrays look the same but actually be different an issue?
Yes, I think that might be the underlying issue here (for the testing). The two DataFrames should not pass the "isequal" test but the underlying pint arrays show that they are equal even though they really aren't as the data they contain is of a different type.
Example:
df = pd.DataFrame({'a':[1,2,3], 'b':['apple', 'pear', 'kiwi']})
p1 = PintArray(df.to_numpy()[:,0], 'm') #inner numpy array contains data of type 'object'
p2 = PintArray(np.array([1,2,3], dtype=np.float64), 'm')
p1.equals(p2) #returns True when it should return False
Yea I think you're right, that's how pandas EA behaves:
>>> p1.data
<NumpyExtensionArray>
[1, 2, 3]
Length: 3, dtype: object
>>> p2.data
<NumpyExtensionArray>
[1.0, 2.0, 3.0]
Length: 3, dtype: float64
>>> p1.data==p2.data
<NumpyExtensionArray>
[True, True, True]
Length: 3, dtype: bool
>>> p1.data.equals(p2.data)
False
thanks for fixing that
I just stumbled upon the same problem, but for data frames that also contain numeric columns without units. The current fix only seems to fix the unwanted type conversion to dtype='object' for columns with units, while the same happens with numeric columns without units.
In the following example, the "something_else" column has numerical values, but no associated units:
import pandas as pd
import pint_pandas
df = pd.DataFrame(
{
'power': pd.Series([1.0, 2.0, 3.0], dtype='pint[W]'),
'torque': pd.Series([4.0, 5.0, 6.0], dtype='pint[N*m]'),
'fruits': pd.Series(['apple', 'pear', 'kiwi']),
'something_else': pd.Series([7.0, 8.0, 9.0]),
}
)
df1 = df.pint.dequantify().pint.quantify(level=-1)
print(df.dtypes)
print(df1.dtypes)
1st print statement:
2nd print statement:
This could probably be fixed by changing the "else" branch at code from else df.values[:, i]
to else df.iloc[:, i]
, similar to the "then" branch in the current fix.
a PR to fix that is welcomed.
I'm hoping someone can help me a bit. I've never contributed to an open source project and am not sure what the protocol is or how to run tests so I appreciate your patience. I'd be happy to modify my request as needed.
The simple change I made enables the
quantify
method on a DataFrame to work even if the DataFrame contains string columns. Currently this works, but there's a unintended side effect. I've included a minimal example below to show what I mean.The current code would output:![2024-04-02_13-07-44](https://github.com/hgrecco/pint-pandas/assets/47535659/a48ea13c-499f-43f2-a02f-37e5bb8adf50)
After the update I made:![2024-04-02_13-12-15](https://github.com/hgrecco/pint-pandas/assets/47535659/bc5e36bb-9fe2-4aaa-aec4-528f74a57513)
As can be seen, the current code converts the underlying data to an
object
data type when it should remainfloat64
. This occurs because the current code takes the entire DataFrame and converts it to a 2D array before indexing (at which point the entire array is converted toobject
datatypes. My modification indexes into individual columns so that the data is never cast to a different underlying type.pre-commit run --all-files
with no errors