pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.53k stars 17.89k forks source link

BUG: Arrow Binary View Types Don't Print When containing missing values #59883

Open WillAyd opened 3 weeks ago

WillAyd commented 3 weeks ago

Pandas version checks

Reproducible Example

Note that this example produces output:

In [10]: import pandas as pd
    ...: import pyarrow as pa
    ...: 
    ...: ser = pd.Series(["foo", "longer_than_binary_prefix", ''], dtype=pd.ArrowDtype(pa.string_view()))

In [11]: ser
Out[11]: 
0                          foo
1    longer_than_binary_prefix
2                             
dtype: string_view[pyarrow]

While this does not:

In [12]: import pandas as pd
    ...: import pyarrow as pa
    ...: 
    ...: ser = pd.Series(["foo", "longer_than_binary_prefix", None], dtype=pd.ArrowDtype(pa.string_view()))

In [13]: ser
Out[13]: 

This might actually be an upstream bug with pyarrow (@jorisvandenbossche typically knows best)



### Issue Description

Values are not printing

### Expected Behavior

Values should print

### Installed Versions

In [14]: pa.__version__
Out[14]: '17.0.0'

In [15]: pd.__version__
Out[15]: '2.2.3+44.g3dfa33cf2d'
jorisvandenbossche commented 3 weeks ago

Quickly checking, if I call to_string() explicitly, it does error:

In [12]: ser.to_string()
...
File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:1458, in ArrowExtensionArray.to_numpy(self, dtype, copy, na_value)
   1456     mask = data.isna()
   1457     result[mask] = na_value
-> 1458     result[~mask] = data[~mask]._pa_array.to_numpy()
   1459 return result

File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:591, in ArrowExtensionArray.__getitem__(self, item)
    589     return self.take(item)
    590 elif item.dtype.kind == "b":
--> 591     return type(self)(self._pa_array.filter(item))
    592 else:
    593     raise IndexError(
    594         "Only integers, slices and integer or "
    595         "boolean arrays are valid indices."
    596     )

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/table.pxi:959, in pyarrow.lib.ChunkedArray.filter()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/compute.py:264, in _make_generic_wrapper.<locals>.wrapper(memory_pool, options, *args, **kwargs)
    262 if args and isinstance(args[0], Expression):
    263     return Expression._call(func_name, list(args), options)
--> 264 return func.call(args, options, memory_pool)

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/_compute.pyx:385, in pyarrow._compute.Function.call()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowNotImplementedError: Function 'array_filter' has no kernel matching input types (string_view, bool)

Essentially because it tries to convert to numpy, and that part is failing (because of a kernel not being implemented for string_view).

Some quick thoughts:

WillAyd commented 3 weeks ago
  • Probably printing something like pandas.Series <exception occurred while creating the repr> would be more useful?

Makes sense for the series, but would this affect the repr when contained within a dataframe?

rhshadrach commented 2 weeks ago

@WillAyd - should the title be Arrow String View? Want to make sure I'm understanding the issue.

WillAyd commented 2 weeks ago

I don't think so - binary view is the terminology used by the arrow specification, which generally covers what you may be thinking of as bytes and strings:

https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout

The same issue occurs with the binary_view pyarrow type as well