pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.4k stars 17.83k forks source link

BUG: `pd.isnull` treats `list` and `tuple` input differently #52283

Open jrbourbeau opened 1 year ago

jrbourbeau commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd
print(f"{pd.isnull([1, 2, 3]) = }")
print(f"{pd.isnull((1, 2, 3)) = }")

Issue Description

It looks like pd.isnull is treating list as array-like and tuple as a scalar

Expected Behavior

I'd expect lists and tuples to be treated similarly by pd.isnull. Similar to other parts of the API like pd.Series

Installed Versions

``` INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.9.15.final.0 python-bits : 64 OS : Darwin OS-release : 22.3.0 Version : Darwin Kernel Version 22.3.0: Mon Jan 30 20:42:11 PST 2023; root:xnu-8792.81.3~2/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.3 numpy : 1.24.0 pytz : 2022.6 dateutil : 2.8.2 setuptools : 59.8.0 pip : 22.3.1 Cython : None pytest : 7.2.0 hypothesis : None sphinx : 4.5.0 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.7.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : fastparquet : 2023.2.0 fsspec : 2022.11.0 gcsfs : None matplotlib : 3.6.2 numba : None numexpr : 2.8.3 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : 2022.11.0 scipy : 1.9.3 snappy : sqlalchemy : 1.4.46 tables : 3.7.0 tabulate : None xarray : 2022.9.0 xlrd : None xlwt : None zstandard : None tzdata : None ```
DeaMariaLeon commented 1 year ago

This function returns a boolean or array-like of bool. I'll keep the "bug" label just in case, but I don't think it is one.

https://pandas.pydata.org/docs/reference/api/pandas.isnull.html

jrbourbeau commented 1 year ago

Thanks @DeaMariaLeon. The thing that seems off to me is pd.isnull is treating lists as array-like (returning a array-like of bools) and tuples as scalar (returning a bool)

In [1]: import pandas as pd

In [2]: pd.isnull([1, pd.NA, 3])
Out[2]: array([False,  True, False])

In [3]: pd.isnull((1, pd.NA, 3))
Out[3]: False

My expectation is that both lists and tuples should be treated as array-like. Though feel free to let me know if that expectation is incorrect

DeaMariaLeon commented 1 year ago

Oh, I see! Thank you for opening an issue. :)

phofl commented 1 year ago

This is an edge case I think.

You can end up with tuples from a MultiIndex for example. In this scenario we want to treat the tuple as a single element, e.g.

df.drop(columns=(1, 2))

treats the tuple as a single element. I think this is similar here although it does not really look intuitive to me either.

rhshadrach commented 1 year ago

Interestingly in a list, tuples are treated as array-like:

obj = [(1.0, 2.0), (1.0, np.nan), (np.nan, 2.0), (np.nan, np.nan)]
print(pd.isnull(obj))
# [[False False]
#  [False  True]
#  [ True False]
#  [ True  True]]
jorisvandenbossche commented 1 year ago

treating lists as array-like (returning a array-like of bools) and tuples as scalar (returning a bool)

As far as I remember, in the past we made this distinction (in certain places) because tuples can be labels, as Patrick mentioned.

But it's indeed a tricky situation, with easy confusion and corner cases (a quick search for "tuple list label" gives quite some related issues). For example https://github.com/pandas-dev/pandas/issues/43978 for the drop example.

Another example in indexing where the two are distinguished and have different behaviour:

>>> s = pd.Series(range(6), index=pd.MultiIndex.from_product([[1, 2, 3], [1, 2]]))
>>> s.loc[(1, 2)]  # tuple is a single label
1
>>> s.loc[[1, 2]]  # list is an indexer (in this case for the first level of the MultiIndex)
1  1    0
   2    1
2  1    2
   2    3
dtype: int64