pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.94k stars 18.03k forks source link

BUG: Series.gt (and other comparison methods) can fail with dtype=object #59418

Open warwickmm opened 3 months ago

warwickmm commented 3 months ago

Pandas version checks

Reproducible Example

>>> import pandas as pd
>>> 
>>> x = pd.Series([None], dtype=object)
>>> y = pd.Series([0])

# This raises a: "TypeError: '>' not supported between instances of 'NoneType' and 'int'"
>>> x.gt(y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/test/venv/lib/python3.12/site-packages/pandas/core/series.py", line 6300, in gt
    return self._flex_method(
           ^^^^^^^^^^^^^^^^^^
  File "/home/test/venv/lib/python3.12/site-packages/pandas/core/series.py", line 6246, in _flex_method
    return self._binop(other, op, level=level, fill_value=fill_value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/test/venv/lib/python3.12/site-packages/pandas/core/series.py", line 6195, in _binop
    result = func(this_vals, other_vals)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>' not supported between instances of 'NoneType' and 'int'

# This runs without error.
>>> x > y
0    False
dtype: bool

# When converted to DataFrames (with object dtypes), .gt runs without error:
>>> x.to_frame().gt(y.to_frame())
       0
0  False

# If the series has dtype=float, the comparison runs without error.
>>> x.astype(float).gt(y)
0    False
dtype: bool

Issue Description

When a Series has dtype=object, comparison methods (e.g., .gt) can raise a TypeError: '>' not supported error. No error is encountered when using the > operator, or when calling DataFrame.gt, or when the Series has dtype=float.

Expected Behavior

When the Series has dtype=object, the behavior of Series.gt should be consistent with the > operator and with the DataFrame.gt method.

Installed Versions

``` INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.4.final.0 python-bits : 64 OS : Linux OS-release : 6.10.2-arch1-1 Version : #1 SMP PREEMPT_DYNAMIC Sat, 27 Jul 2024 16:49:55 +0000 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 2.0.1 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 71.1.0 pip : 23.2.1 Cython : None pytest : 8.3.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.9.1 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None ```
Patsnoop commented 3 months ago

I would like to work on this

rhshadrach commented 3 months ago

Thanks for the report - it seems to me comparing None to e.g. integers should raise. My guess is that x > y succeeding is a result of assuming None is an NA value and hence behaves like np.nan (always false for comparisons). Further investigations are welcome!

KevsterAmp commented 3 months ago

take

KevsterAmp commented 3 months ago

@rhshadrach - Any ideas for a fix? do we raise an error when "<" is used between Series that contains None?

rhshadrach commented 3 months ago

That seems like the correct behavior to me - yes.

warwickmm commented 3 months ago

Should DataFrame.gt raise an error as well?

warwickmm commented 3 months ago

Also, should one expect the behavior to be consistent across all values for which pd.isna returns True (e.g., None, np.nan, pd.NA, etc.)? Or does one need to be cognizant of how missing values are represented in each instance?

rhshadrach commented 3 months ago

My above comments are only regarding Python's None when stored in an object-dtype column or Series.

warwickmm commented 3 months ago

Thanks. I'll just note that the below also currently runs without error. Not sure if that's a situation that needs to be considered as well.

>>> x = pd.Series([None], dtype=object)
>>> x.gt(0)
0    False
dtype: bool
maushumee commented 3 months ago

Hi @warwickmm! Are you working on this? If not, I would like to take this up.

warwickmm commented 3 months ago

I am not.

maushumee commented 3 months ago

take