pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.93k forks source link

BUG (string dtype): comparison of string column to mixed object column fails #60228

Open jorisvandenbossche opened 2 hours ago

jorisvandenbossche commented 2 hours ago

At the moment you can freely compare with mixed object dtype column:

>>> ser_string = pd.Series(["a", "b"])
>>> ser_mixed = pd.Series([1, "b"])
>>> ser_string == ser_mixed
0    False
1     True
dtype: bool

But with the string dtype enabled (using pyarrow), this now raises an error:

>>> pd.options.future.infer_string = True
>>> ser_string = pd.Series(["a", "b"])
>>> ser_mixed = pd.Series([1, "b"])
>>> ser_string == ser_mixed
...
File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:510, in ArrowExtensionArray._box_pa_array(cls, value, pa_type, copy)
...
--> 510     pa_array = pa.array(value, from_pandas=True)
...
ArrowInvalid: Could not convert 'b' with type str: tried to convert to int64

This happens because the ArrowEA tries to convert the other operand to Arrow as well, which fails for mixed types.

In general, I think our rule is that == comparison never fails, but then just gives False for when values are not comparable.

jorisvandenbossche commented 2 hours ago

It seems we actually have a comment in the code about this issue in case of object dtype:

https://github.com/pandas-dev/pandas/blob/692ea6f9d4b05187a05f0811d3241211855d6efb/pandas/core/arrays/arrow/array.py#L728-L734