[BUG] `pyarrow.table` does not accept `cudf.pandas.DataFrame`

galipremsagar commented 10 months ago

Background

Most of pyarrow is implemented in Cython. They have a lazy-loaded pandas-shim which they use to interoperate with pandas. This is implemented as the _PandasAPIShim cdef class. There is a singleton shim object that is accessible as pyarrow.lib._pandas_api from python (and as both pyarrow.lib._pandas_api and pyarrow.lib.pandas_api from cython):

In [1]: import pyarrow
In [2]: pyarrow.lib.pandas_api
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 pyarrow.lib.pandas_api

AttributeError: module 'pyarrow.lib' has no attribute 'pandas_api'

In [3]: pyarrow.lib._pandas_api
Out[3]: <pyarrow.lib._PandasAPIShim at 0x7f3838413c10>

So at import time we make this API shim, and then it lazily initialises itself on first use.

This object saves a number of things:

The observed pandas module it exported (by doing import pandas; self.pandas = pandas)
Various type constructors (e.g. self.data_frame = pandas.DataFrame)

The first is relatively unproblematic. What we would like is for that module to be our intercepted wrapped module, which we can arrange with a little bit of rejigging of imports in cudf. It is the memoisation of the type constructors that is the problematic thing.

cudf.pandas wrapping scheme

Recall that the way our wrapping scheme works is that we deliver wrapped modules and decide at __getattr__ time whether any attribute lookups deliver real or wrapped attributes. So:

%load_ext cudf.pandas

import pandas as pd # pd is _always_ a wrapped module

DataFrame = pd.DataFrame # this is context-dependant either a real or wrapped constructor

This works well, as long as someone doesn't memoise an attribute lookup. If they do, we only get to make the decision about what type of attribute to deliver once:

In [1]: %load_ext cudf.pandas

In [3]: from cudf.pandas.module_finder import disable_transparent_mode_if_enabled

In [4]: with disable_transparent_mode_if_enabled():
   ...:     from pandas import DataFrame
   ...: 

In [5]: DataFrame
Out[5]: pandas.core.frame.DataFrame

In [6]: from pandas import DataFrame

In [7]: DataFrame
Out[7]: cudf.pandas._wrappers.pandas.DataFrame

Unfortunately, pyarrow's pandas shim does exactly this. And we can't make the right decision, because sometimes (when used inside cudf) we need to deliver real objects, other times (when the user is using pyarrow) we need to deliver wrapped ones.

I said it would be a miracle if @shwina's approach worked, and it kind of does, but unfortunately it's not quite miraculous enough. Here's what's going on:

pa.table uses pa.lib._pandas_api.is_data_frame to determine if the passed object is a pandas dataframe. The leading underscore here is crucial! This is a python object that we can replace. Similarly, pa.Table.from_pandas calls out to some Python code that uses pa.lib._pandas_api (which we can control).

However, the to_pandas method on the resulting object calls pyarrow.lib.pandas_api.data_frame note no leading underscore. This is a Cython level module attribute that we can't replace from Python.

So, we have this:

%load_ext cudf.pandas

import pyarrow as pa
pa.lib._pandas_api = pa.lib._PandasAPIShim()
pa.lib.pandas_api = pa.lib._pandas_api # this doesn't do anything at the Cython level

import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
tab = pa.table(df) # works
df2 = tab.to_pandas()

print(type(df))
# <class 'cudf.pandas._wrappers.pandas.DataFrame'>
print(type(df2))
# <class 'pandas.core.frame.DataFrame'>

What if we initialise the Cython level object with wrapped attributes?

This seems like it might work, we have to be a bit careful about how we're importing pyarrow in cudf, but we can make this work so that pa.lib.pandas_api is a shim wrapper that sees our wrapped attributes.

But the memoisation breaks things, because inside cudf we use to_arrow().to_pandas() in various places to produce an honest-to-goodness pandas object, but this will now produce a wrapped object (and wrapping things in a disable_transparent_mode_if_enabled() context manager won't help because we already took the decision about what constructor to deliver).

Options

Convince the arrow folks that the memoisation of attribute lookup in their pandas shim is not really a performance win, and that it would be convenient if they just used self.pandas.DataFrame. This would allow us to (context-dependently) provide a real or wrapped object as appropriate. We would likely do this after releasing cudf-PAM since then the motivation is clear.
Rather than letting module attribute lookup be context dependent, always deliver wrapped types such that the constructor context-dependently decides whether or not to deliver a real or wrapped type.
Something else?

Steps/Code to reproduce bug

In [1]: %load_ext cudf.pandas

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({'a': [1, 2, 3]})

In [4]: import pyarrow as pa

In [5]: pa.table(df)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 pa.table(df)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/table.pxi:5165, in pyarrow.lib.table()

TypeError: Expected pandas DataFrame, python dictionary or list of arrays

wence- commented 10 months ago

Plausibly, at the cost of making the to_arrow functions in cudf aware of things, we could do this by unwrapping the wrapped result of to_arrow().to_pandas() (replacing it with to_arrow().to_pandas()._fsproxy_slow).

shwina commented 10 months ago

Plausibly, at the cost of making the to_arrow functions in cudf aware of things

If this means making cudf aware of cudf.pandas proxy types, I'm strongly -1 to this. I'd really like to keep the separation of concerns here.

Rather than letting module attribute lookup be context dependent, always deliver wrapped types such that the constructor context-dependently decides whether or not to deliver a real or wrapped type.

This would likely break a bunch of things, but specifically it would break the following:

# somewhere inside pandas code:
isinstance(df, pd.DataFrame)  # returns false

Something else?

Maybe https://github.com/apache/arrow/issues/38325?

wence- commented 10 months ago

Plausibly, at the cost of making the to_arrow functions in cudf aware of things

If this means making cudf aware of cudf.pandas proxy types, I'm strongly -1 to this. I'd really like to keep the separation of concerns here.

Yeah, I don't want that either.

Rather than letting module attribute lookup be context dependent, always deliver wrapped types such that the constructor context-dependently decides whether or not to deliver a real or wrapped type.

This would likely break a bunch of things, but specifically it would break the following:
# somewhere inside pandas code:
isinstance(df, pd.DataFrame)  # returns false

You could dodge that bullet by continuing to also maintain the module getattr handling, perhaps allowlisted inside the pandas codebase, I think.

Something else?

Maybe apache/arrow#38325?

Does that help? That seems to be an interface for objects consuming arrow-like things and delivering "something". I guess that if, as a result, rather than doing arrow_table.to_pandas() we instead do pandas.from_arrow(arrow_table) (like cudf.from_arrow) then we're in business because we can just call pandas.from_arrow(...) inside cudf, and that also becomes the blessed way to get from arrow to pandas externally. Then we need no change because our existing denylisting on module attribute lookup would do the right thing.

shwina commented 10 months ago

You could dodge that bullet by continuing to also maintain the module getattr handling, perhaps allowlisted inside the pandas codebase, I think.

Yeah that might do it...

rjzamora commented 10 months ago

While discussing some current dask + cudf.pandas problems with @shwina, he pointed me to this issue. Thanks for pushing on this!

Unfortunately, I cannot suggest any new ideas here, but I do want to add a bit more Dask-related motivation:

Both dask.dataframe and dask_expr are currently using pa.Table.to_pandas() in several critical places. The most important places are read_parquet and “p2p” shuffling. This essentially means that cudf.pandas will run into problems for any Parquet based Dask workflow. It also means that any “p2p”-based shuffle (used by default for sorting and merging when a distributed client is active) will produce “real” pd.DataFrame objects.

I know that Ashwin is already aware of these problems, but I wanted to make sure others had the same context.

rapidsai / cudf