hgrecco / pint-pandas

Pandas support for pint
Other
169 stars 42 forks source link

Add support for pandas eval #137

Closed andyr0id closed 2 months ago

andyr0id commented 1 year ago

Hello, I'm running into an issue with using pint_pandas with pandas' eval function:

import pandas as pd
import pint_pandas

df = pd.DataFrame({'a': pd.Series([1., 2., 3.], dtype='pint[meter]'), 'b': pd.Series([4., 5., 6.], dtype='pint[second]')})
# this works as expected
print(df['a'] / df['b'])
"""
0    0.25
1     0.4
2     0.5
dtype: pint[meter / second]
"""

# this is not working
print(df.eval('a / b'))
"""
TypeError: Cannot interpret 'pint[meter]' as a data type
"""

The problem arises when pandas tries to check if the columns are numeric by calling this function. This is in turn passed to np.dtype(dtype), which results in the below type error.

It's important for my application to use pandas eval, and the above is just a toy example.

Full stack trace:

Traceback (most recent call last):
  File "scratch_8.py", line 8, in <module>
    print(df.eval('a / b'))
  File "pandas/core/frame.py", line 4240, in eval
    return _eval(expr, inplace=inplace, **kwargs)
  File "pandas/core/computation/eval.py", line 350, in eval
    parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
  File "pandas/core/computation/expr.py", line 811, in __init__
    self.terms = self.parse()
  File "pandas/core/computation/expr.py", line 830, in parse
    return self._visitor.visit(self.expr)
  File "pandas/core/computation/expr.py", line 415, in visit
    return visitor(node, **kwargs)
  File "pandas/core/computation/expr.py", line 421, in visit_Module
    return self.visit(expr, **kwargs)
  File "pandas/core/computation/expr.py", line 415, in visit
    return visitor(node, **kwargs)
  File "pandas/core/computation/expr.py", line 424, in visit_Expr
    return self.visit(node.value, **kwargs)
  File "pandas/core/computation/expr.py", line 415, in visit
    return visitor(node, **kwargs)
  File "pandas/core/computation/expr.py", line 538, in visit_BinOp
    return self._maybe_evaluate_binop(op, op_class, left, right)
  File "pandas/core/computation/expr.py", line 505, in _maybe_evaluate_binop
    res = op(lhs, rhs)
  File "pandas/core/computation/expr.py", line 541, in <lambda>
    return lambda lhs, rhs: Div(lhs, rhs)
  File "pandas/core/computation/ops.py", line 536, in __init__
    if not isnumeric(lhs.return_type) or not isnumeric(rhs.return_type):
  File "pandas/core/computation/ops.py", line 520, in isnumeric
    return issubclass(np.dtype(dtype).type, np.number)
TypeError: Cannot interpret 'pint[meter]' as a data type
andrewgsavage commented 1 year ago

I think isnumeric will need to return true for pinttypes. Had a similar issue when plotting and pinttypes were filtered out by df.select_dytpes(is_numeric). https://github.com/pandas-dev/pandas/issues/35340

You should open an issue in pandas for this.

mutricyl commented 3 months ago

pandas.core.ops.computation.isnumeric is essentially based on numpy :

def isnumeric(dtype) -> bool:
    return issubclass(np.dtype(dtype).type, np.number)

At this stage dtype variable type is <class 'pint_pandas.pint_array.PintType'>. Adding a dtype property to this class/variable with an actual numeric dtype allows to pass the isnumeric test. Final result however is somehow disappointing as types are lost in the process:

c:\Users\xxxxxxx\AppData\Local\miniconda3\Lib\site-packages\pandas\core\arrays\numpy_.py:127: UnitStrippedWarning: The unit of the quantity is stripped when             to ndarray.
downcasting to ndarray.
  result = np.asarray(scalars, dtype=dtype)  # type: ignore[arg-type]
0    0.25
1    0.40
2    0.50
dtype: float64
andrewgsavage commented 3 months ago

I faced a similar issue when getting plotting to work. A similar isnumeric command was returning false. isnumeric should also be checking the _is_numeric attribute of the dtype if the dtype is an extensiondtype.

The np.array... would also need changing so it can return an extensionarray.

These are things that would need changing in pandas. you'll need to open an issue and pr there.

mutricyl commented 3 months ago

Made a test to confirm that the issue is definitly on pandas side. I used IntegerArrays which are an other type of ExtensionArray and evals also fails at the isnumeric stage.

df = pd.DataFrame({'a': pd.array([1, 2, 3]), 'b': pd.array([4, 5, 6])})
df.eval('a / b')

leads to TypeError: Cannot interpret 'Int64Dtype()' as a data type

I'll open an issue there.

mutricyl commented 3 months ago
  1. this isnumeric part of the issue is being solved.
  2. There is second issue afterwards as pandas.core.computation.ops.Div class will _cast_inplace input arrays to floats loosing in the process the ExtensionArray specifics. (this induce by the way an issue when performing eval on complex values). This casting is in my opinion not needed anymore (was introduced 6 years ago) and we shall ask and convince pandas teams to suppress this casting.
  3. There is a third issue then: pandas.core.computation.ops.Op.has_invalid_return_type is called from pandas.core.computation.expr.BaseExprVisitor._maybe_evaluate_binop inducing a TypeError. Bypassing this test allows to get the proper result. This is not an issue when working with pandas build in extension arrays so there might be something specific to do in pint-pandas. pandas.core.common.result_type_many is also involved in the process and be incriminated.

This is more or less a note to myself but fell free to drop a comment if any of the above rings a bell.

mutricyl commented 3 months ago

For the third item we can resolve the issue overloading the _get_common_dtype function of ExtensionDtype in PintType. We can find examples of such method in pandas.core.dtypes.dtypes.BaseMaskedDtype._get_common_dtype or pandas.core.dtypes.base.ExtensionDtype._get_common_dtype. The second one is pretty trivial but does not fit our needs.

I think we can go as simple as the following example:

 class PintType(ExtensionDtype):
(...)   
    def _get_common_dtype(self, dtypes):
        return self
mutricyl commented 2 months ago
  1. ~this isnumeric part of the issue is being solved.~
  2. ~There is second issue afterwards as pandas.core.computation.ops.Div class will _cast_inplace input arrays to floats loosing in the process the ExtensionArray specifics. (this induce by the way an issue when performing eval on complex values). This casting is in my opinion not needed anymore (was introduced 6 years ago) and we shall ask and convince pandas teams to suppress this casting.~
  3. There is a third issue then: pandas.core.computation.ops.Op.has_invalid_return_type is called from pandas.core.computation.expr.BaseExprVisitor._maybe_evaluate_binop inducing a TypeError. Bypassing this test allows to get the proper result. This is not an issue when working with pandas build in extension arrays so there might be something specific to do in pint-pandas. pandas.core.common.result_type_many is also involved in the process and be incriminated.

Two first point have been managed in pandas core. We can now concentrate on the third item.