pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.7k stars 17.92k forks source link

Consider moving pyarrow's pandas compatibility and conversion code to the pandas project? #59780

Open jorisvandenbossche opened 1 month ago

jorisvandenbossche commented 1 month ago

This issue is to discuss the idea of moving a significant part of the pandas conversion and compatibility code that currently lives in pyarrow to the pandas project itself. Of course we would keep all low-level conversions (e.g. everything that lives in pyarrow C++) at the array-level in pyarrow itself (i.e. what pandas would use), but I think that a large part of what lives in pyarrow/pandas_compat.py could live in pandas.

Some reasons to do this:

A potential downside is that it makes the dependency structure even more complex (pyarrow's to_pandas() relying on pandas relying on pyarrow), although pyarrow already has infrastructure set up to lazily import pandas today.

The idea is not that we would change any public pyarrow API that supports pandas (ingesting pandas in various pyarrow constructors, to_pandas() methods on objects, etc), but that at least for the DataFrame and Series level, pyarrow would under the hood rely on a method from pandas to do that conversion. For example, I think that most of the handling of the "pandas metadata" (to guarantee a better pandas <-> arrow roundtrip) could live in pandas itself, or the code to convert column labels to strings and reconstruct an Index in the other direction, determining which columns should be converted as an extension dtype vs numpy dtype, etc

There are of course a lot of details to figure out, but wanted to already open the issue to get a general idea of what people think about this, and if we want to maintain this in pandas.

Equivalent issue on the pyarrow side: https://github.com/apache/arrow/issues/44068

jorisvandenbossche commented 1 month ago

Two additional notes: