pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.56k stars 17.56k forks source link

ENH: Pandas Tensor Data Type #59006

Open bionicles opened 2 weeks ago

bionicles commented 2 weeks ago

Feature Type

Problem Description

Reviewing Arrow docs link from @WillAyd, spotted this

https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor

Tensor is exactly what I'm talking about in Additional Context [1] and would enable Pandas users to have a column datatype for big blocks of some underlying type

Feature Description

Support Arrow Tensor in Pandas

Python https://arrow.apache.org/docs/python/generated/pyarrow.Tensor.html#pyarrow.Tensor

Rust https://github.com/apache/arrow-rs/blob/3715d5447e468a5a4dc631ae9aafec706c57aa20/arrow/src/tensor.rs#L115

Alternative Solutions

just make everything an "object":

>>> import numpy as np
>>> import pandas as pd
>>> x = {'hello': 'world'}
>>> y = np.ones(3)
>>> df = pd.DataFrame({'X': [x], 'Y': [y]})
>>> df
                    X                Y
0  {'hello': 'world'}  [1.0, 1.0, 1.0]
>>> df.dtypes
X    object
Y    object
dtype: object

Additional Context

[1] https://github.com/pandas-dev/pandas/pull/58455#issuecomment-2161603939 onward

mroeschke commented 2 weeks ago

cc @jbrockmendel if pyarrow plans to support it's compute functions for pyarrow.Tensors, this may be the appropriate 2D EA block backing for ArrowExtensionArray instead of pyarrow.Table

WillAyd commented 1 week ago

I think the nullability bitmap for the extension array only applies to the entire datum itself, not to individual records within each struct