pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.87k stars 18.02k forks source link

BUG: `isna` on pyarrow backed Series is returning Series with `bool` dtype instead of `bool[pyarrow]` #59431

Open thesword53 opened 3 months ago

thesword53 commented 3 months ago

Pandas version checks

Reproducible Example

>>> s = pd.Series([0, None, 4, 5], dtype="u1[pyarrow]")
>>> s
0       0
1    <NA>
2       4
3       5
dtype: uint8[pyarrow]

>>> s.isna()
0    False
1     True
2    False
3    False
dtype: bool

Issue Description

s.isna().dtype is BoolDType (bool) instead of ArrowDtype(pa.bool_()) (bool[pyarrow])

Expected Behavior

>>> s.isna()
0    False
1     True
2    False
3    False
dtype: bool[pyarrow]

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD byteorder : little LC_ALL : None LANG : fr_FR.UTF-8 LOCALE : fr_FR.cp1252 pandas : 2.2.2 numpy : 2.0.1 pytz : 2024.1 dateutil : 2.9.0 setuptools : 70.2.0 pip : None Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.2 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.22.2 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.9.0 numba : None numexpr : 2.10.1 odfpy : None openpyxl : 3.1.5 pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
loicdiridollou commented 3 months ago

Hey @thesword53, I realized that issue also affected something I am working on so I went down the rabbit hole and it seems like what is happening is that the Series gets cast to a np.ndarray then the is isna operation gets applied and when they rebuild the Series object, we lose the original type (pyarrow) and it seems like it just rebuilds without any assumption of type (as we pass an np.ndarray of bool it just set the type of the Series to bool and not bool[pyarrow]).

https://github.com/pandas-dev/pandas/blob/aa134bb9495754271f54a9b887ffcd85fca9d956/pandas/core/dtypes/missing.py#L208-L210

This also affects if you create a Dataframe where the type of the column was originally uint8[pyarrow] and it gets cast into bool and not bool[pyarrow].

KevsterAmp commented 3 months ago

I'd like to work on this

KevsterAmp commented 3 months ago

take

KevsterAmp commented 3 months ago

take

KevsterAmp commented 3 months ago

take

KevsterAmp commented 3 months ago

Can't seem to assign the issue to myself, but I'll be opening a PR for this in a bit. Thanks @loicdiridollou for further investigating

KevsterAmp commented 3 months ago

Take

rhshadrach commented 3 months ago

Ref: https://github.com/pandas-dev/pandas/pull/59436#pullrequestreview-2225761630

WillAyd commented 3 months ago

This is another good issue to track for PDEP-13 https://github.com/pandas-dev/pandas/pull/58455