pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.57k stars 17.56k forks source link

ENH: Back pd.BooleanArray with nanoarrow #59115

Open WillAyd opened 4 days ago

WillAyd commented 4 days ago

Feature Type

Problem Description

The existing pd.arrays.BooleanArray serves a good purpose to allow True/False with missing values, but the current implementation is horribly inefficient. Coming from the historical NumPy perspective, the implementation uses twice as much memory. Compared to PyArrow the memory usage is 8x as much and computational algorithms can be up to 64x slower

Feature Description

The pd.arrays.BooleanArray could use nanoarrow behind the scenes for its implementation, rather than the existing NumPy approach.

I think the main technical challenges for this would be:

  1. Build system integration. nanoarrow is already available in the Meson WrapDB and progress is underway with nanobind; probably worth waiting for the latter, but once complete this is less of a concern
  2. 2D support, if ever needed. You could try to simulate 2D indexing operations with a bitmask, but something like transposition (which are trivial with a bytemask) is a concept that does not translate well moving from bytes to bits . I don't know that this is a huge issue since the existing BooleanArray does not support 2D, but @jbrockmendel probably knows best on any plans for that

Alternative Solutions

status quo

Additional Context

No response

jbrockmendel commented 3 days ago

You could try to simulate 2D indexing operations with a bitmask, but something like transposition

This gets handled by setting can_fast_transpose to False.

IIRC @phofl has expressed skepticism about taking on nanoarrow.