pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.11k stars 1.83k forks source link

Support for Arrow Extension types #9112

Open kylebarron opened 1 year ago

kylebarron commented 1 year ago

Problem description

(Creating this as a stub after closing https://github.com/pola-rs/polars/issues/4014)

Now that the Array data type has been implemented, there may be interest in supporting Arrow extension types, which allow storing other pieces of metadata on the column. Some examples:

spenczar commented 1 year ago

I am very interested in FixedShapeTensor support. In case it helps guide implementation, I'll explain more about how I use them, which is to store covariance matrixes.

These covariance matrixes are typically 6x6, in my case, but sometimes 3x3. I'm interested in getting their diagonal elements, and in selecting specific off-axis elements.

Nullability of individual elements is very important to me as well.

ritchie46 commented 1 year ago

We can store a metadata flag on the Array data-type. I don't feel much for different codepaths for extension typed arrays. This would really add a lot of complexity and code-bloat.

I have to check, but I also don't think that's needed. Keeping the meta-data around should be enough.

spenczar commented 1 year ago

As of 6fbc142a0185ecdb846e7186bf077aa86b41049d, polars.from_arrow raises an exception when passed an extension array:

>>> import polars
>>> import pyarrow as pa
>>> import numpy as np
>>> tensor_array = pa.FixedShapeTensorArray.from_numpy_ndarray(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))
>>> polars.from_arrow(tensor_array)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/swnelson/code/3p/polars/py-polars/polars/convert.py", line 605, in from_arrow
    data=pl.Series._from_arrow(name, data, rechunk=rechunk),
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/swnelson/code/3p/polars/py-polars/polars/series/series.py", line 325, in _from_arrow
    return cls._from_pyseries(arrow_to_pyseries(name, values, rechunk))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/swnelson/code/3p/polars/py-polars/polars/utils/_construction.py", line 178, in arrow_to_pyseries
    pys = PySeries.from_arrow(name, array)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
exceptions.ComputeError: cannot create series from Extension("arrow.fixed_shape_tensor", FixedSizeList(Field { name: "item", data_type: Int64, is_nullable: true, metadata: {} }, 3), Some("{\"shape\":[3]}"))

Similar issue if passed a pyarrow.Table which contains an Extension array:

>>> table = pa.table([tensor_array], names=["vals"])
>>> polars.from_arrow(table)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/swnelson/code/3p/polars/py-polars/polars/convert.py", line 599, in from_arrow
    return pl.DataFrame._from_arrow(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/swnelson/code/3p/polars/py-polars/polars/dataframe/frame.py", line 612, in _from_arrow
    arrow_to_pydf(
  File "/Users/swnelson/code/3p/polars/py-polars/polars/utils/_construction.py", line 1338, in arrow_to_pydf
    pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
exceptions.ComputeError: cannot create series from Extension("arrow.fixed_shape_tensor", FixedSizeList(Field { name: "item", data_type: Int64, is_nullable: true, metadata: {} }, 3), Some("{\"shape\":[3]}"))
kylebarron commented 1 year ago

Just a note that the geospatial types would need to have metadata on both fixed-size and variable-size arrays, as a series of Polygons or LineStrings would only have the Array type as its most inner type.

mschrader15 commented 11 months ago

I second the use case from https://github.com/pola-rs/polars/issues/9112#issuecomment-1567561391

kylebarron commented 7 months ago

Are there any thoughts about where the extension metadata should be stored in Polars? In Arrow, extension metadata is stored on the field (though arrow2 and presumably also polars-arrow has a DataType::Extension() variant which also can store extension metadata). In that case the metadata would be repeated for each chunk of a ChunkedArray/Series (though potentially under an Arc?).

Would it be preferred to store extension metadata on the ChunkedArray or on the Series? If it's stored on ChunkedArray, it would have to be added in more places in the code? I don't know whether adding it onto the Series itself would make sense, because you couldn't use it from the typed representation?

- pub struct Series(pub Arc<dyn SeriesTrait>);
+ pub struct Series(pub Arc<dyn SeriesTrait>, SeriesMetadata);

My motivation for extension types is to support extended logical types in pyo3-polars. It looks like those APIs always accept and return a Series, and so either approach would work for that.

rok commented 4 months ago

Arrow's growing canonical extension list (fixed shape tensor, variable shape tensor, UUID, JSON) would be another argument for adding extension types.

adienes commented 2 months ago

the lack of support (and in particular, not ignoring unknown extensions) is making multi-language workflows more difficult. see for example https://github.com/apache/arrow-julia/issues/508

AlexanderNenninger commented 1 month ago

I recently wrote an issue to support such use cases: https://github.com/pola-rs/polars/issues/15689. The idea is for Polars to provide very basic infrastructure and let the rest be handled by extensions and plugins.