Closed pstorozenko closed 1 year ago
I have found what I believe is the same bug.
Tested it in 2.0 RC1 and in master branch 2.1.0.dev0+265.gd8d1a474c7
Using a single string column csv file bug.csv.gz
import pandas as pd
df = pd.read_csv("bug.csv", dtype={"data_validity_comment": "string[pyarrow]"})
outside = df[df["data_validity_comment"] == "Outside typical range"]
# python crashes
Dataframe shape is (145000, 1). It stops crashing if I cut it exactly at 143116.
import pandas as pd
df = pd.read_csv("bug.csv", dtype={"data_validity_comment": "string[pyarrow]"})
df = df[0:143116].copy()
outside = df[df["data_validity_comment"] == "Outside typical range"]
This is also the simpler example I could come with...
cc @mroeschke
Looks like this is coming from pc.replace_with_mask(values, pa.array(mask), replacements)
in _replace_with_mask
. There is a comment that this caused segfaults for earlier version e.g. pyarrow < 8. Looks like this is not completely solved. Thoughts here?
The following reproduces for me:
x = pa.scalar(False, type=pa.bool_(), from_pandas=True)
arr = pa.chunked_array([np.array([True] * 5)])
mask = pa.array([False] * 5)
pc.replace_with_mask(arr, mask, pa.array([False] * 5))
Calling arr.combine_chunks()
before passing to pc.replace_with_mask
seems to work. Probably a good idea to report upstream since this is happening with the latest version of pyarrow.
Unfortunately I never used pyarrow
directly, could you report it, as you know much better what is really bugged here?
I opened https://github.com/apache/arrow/issues/34634 upstream in the arrow repo.
I also experienced this, pyarrow backed series contains <NA>
in comparison result.
pd.Series([1, pd.NA, 2]) > 0
Out[24]:
0 True
1 False
2 True
dtype: bool
pd.Series([1, pd.NA, 2], dtype='int32[pyarrow]') > 0
Out[25]:
0 True
1 <NA>
2 True
dtype: bool[pyarrow]
@char101 I think that's by design and that's how it should work. The problem with numpy array is that it cannot differentiate between NA and NaN for float, and don't have any NA for ints (so int arrays with NA are converted to floats as in your case). With arrow backend, we can finally differentiate between NA and NaN for floats, and introduce 'proper' na for other types, like here for ints. You don't know whether NA is greater or smaller than 0, so you get NA in the result.
@pstorozenko Thanks, I understand. The reason I thought it was a bug is that the first <NA>
from the result of diff()
crashed python as I wrote in https://github.com/pandas-dev/pandas/issues/52122 . The difference is that it crashed in 517 elements rather than 145000 like in this issue.
Edit: Adding values.combine_chunks()
as from the pull request by phofl fixes it. I was using the nightly wheel of pandas and I thought that the pull request has already been merged but when I checked it again it was still open.
This bug is related to pandas 2.0. On 1.5.3 (with numpy as dtype backend) everything works.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Using parquet from wiki100.zip.
Issue Description
After running such script I get
Expected Behavior
I should get the result of this query:
Installed Versions
Context
The file is a subset subset of wiki clickstream data.
This code works well if I don't set
dtype_backend="pyarrow"
. Tested in both 2.0.0rc1 and on nightly.The operation
wiki['curr'] == "Warsaw"
executes correctly so it's an issue with filtering on boolean array. Sorry for not providing a simpler example, as all those handcrafted worked all the time.