pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.62k stars 17.91k forks source link

BUG: python crashes on filtering with `.loc` on boolean Series with `dtype_backend=pyarrow` on some dataframes. #52059

Closed pstorozenko closed 1 year ago

pstorozenko commented 1 year ago

This bug is related to pandas 2.0. On 1.5.3 (with numpy as dtype backend) everything works.

Pandas version checks

Reproducible Example

Using parquet from wiki100.zip.

import pandas as pd

wiki = pd.read_parquet("wiki100.parquet", engine="pyarrow",  dtype_backend="pyarrow")

wiki.loc[wiki['curr'] == "Warsaw", :]

Issue Description

After running such script I get

➜  python 02_pandas20.py
[1]    24955 floating point exception (core dumped)  python 02_pandas20.py

Expected Behavior

I should get the result of this query:

                                                       prev    curr  type    n
13488247                                            Trumpet  Warsaw  link   10
13488399                                Bronislava_Nijinska  Warsaw  link   33
13488166  List_of_European_cities_by_population_within_c...  Warsaw  link  480
13488365                               Warsaw_pogrom_(1881)  Warsaw  link   10
13488408                                     Witold_Pilecki  Warsaw  link   19

Installed Versions

INSTALLED VERSIONS ------------------ commit : 23c3dc2c379eb325fd4f33be78338730d3f35731 python : 3.10.9.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-35-generic Version : #36~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 17 15:17:25 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.1.0.dev0+246.g23c3dc2c37 numpy : 1.25.0.dev0+918.g28bce82c8 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 65.6.3 pip : 23.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2022.7 qtpy : None pyqt5 : None None

Context

The file is a subset subset of wiki clickstream data.

This code works well if I don't set dtype_backend="pyarrow". Tested in both 2.0.0rc1 and on nightly.

The operation wiki['curr'] == "Warsaw" executes correctly so it's an issue with filtering on boolean array. Sorry for not providing a simpler example, as all those handcrafted worked all the time.

eloyfelix commented 1 year ago

I have found what I believe is the same bug.

Tested it in 2.0 RC1 and in master branch 2.1.0.dev0+265.gd8d1a474c7

Using a single string column csv file bug.csv.gz

import pandas as pd
df = pd.read_csv("bug.csv", dtype={"data_validity_comment": "string[pyarrow]"})
outside = df[df["data_validity_comment"] == "Outside typical range"]
# python crashes

Dataframe shape is (145000, 1). It stops crashing if I cut it exactly at 143116.

import pandas as pd
df = pd.read_csv("bug.csv", dtype={"data_validity_comment": "string[pyarrow]"})
df = df[0:143116].copy()
outside = df[df["data_validity_comment"] == "Outside typical range"]

This is also the simpler example I could come with...

phofl commented 1 year ago

cc @mroeschke

Looks like this is coming from pc.replace_with_mask(values, pa.array(mask), replacements) in _replace_with_mask. There is a comment that this caused segfaults for earlier version e.g. pyarrow < 8. Looks like this is not completely solved. Thoughts here?

The following reproduces for me:

x = pa.scalar(False, type=pa.bool_(), from_pandas=True)
arr = pa.chunked_array([np.array([True] * 5)])
mask = pa.array([False] * 5)
pc.replace_with_mask(arr, mask, pa.array([False] * 5))
lukemanley commented 1 year ago

Calling arr.combine_chunks() before passing to pc.replace_with_mask seems to work. Probably a good idea to report upstream since this is happening with the latest version of pyarrow.

pstorozenko commented 1 year ago

Unfortunately I never used pyarrow directly, could you report it, as you know much better what is really bugged here?

lukemanley commented 1 year ago

I opened https://github.com/apache/arrow/issues/34634 upstream in the arrow repo.

char101 commented 1 year ago

I also experienced this, pyarrow backed series contains <NA> in comparison result.

pd.Series([1, pd.NA, 2]) > 0
Out[24]: 
0     True
1    False
2     True
dtype: bool

pd.Series([1, pd.NA, 2], dtype='int32[pyarrow]') > 0
Out[25]: 
0    True
1    <NA>
2    True
dtype: bool[pyarrow]
pstorozenko commented 1 year ago

@char101 I think that's by design and that's how it should work. The problem with numpy array is that it cannot differentiate between NA and NaN for float, and don't have any NA for ints (so int arrays with NA are converted to floats as in your case). With arrow backend, we can finally differentiate between NA and NaN for floats, and introduce 'proper' na for other types, like here for ints. You don't know whether NA is greater or smaller than 0, so you get NA in the result.

char101 commented 1 year ago

@pstorozenko Thanks, I understand. The reason I thought it was a bug is that the first <NA> from the result of diff() crashed python as I wrote in https://github.com/pandas-dev/pandas/issues/52122 . The difference is that it crashed in 517 elements rather than 145000 like in this issue.

Edit: Adding values.combine_chunks() as from the pull request by phofl fixes it. I was using the nightly wheel of pandas and I thought that the pull request has already been merged but when I checked it again it was still open.