kylebarron / arro3

A minimal Python library for Apache Arrow, connecting to the Rust arrow crate
https://kylebarron.dev/arro3
Apache License 2.0
80 stars 6 forks source link

Feature requests #195

Open martindurant opened 3 hours ago

martindurant commented 3 hours ago

It would be great:

Doing all of this would essentially answer what is envisaged in https://github.com/dask/fastparquet/pull/931 : getting what we really need out of arrow without the cruft. It would interoperate nicely with awkward, for example.

Other nice to haves (and I realise you wish to keep the scope as small as possible)

kylebarron commented 2 hours ago
  • you could select columns for reading from parquet, or, even better, select from the schema hierarchy in general for deeper structured datasets

This is relatively easy to do; you just have to map input into a ProjectionMask.

  • you allow reading row-group X from a parquet dataset; this would allow for distributing the work to threads or even a cluster. Of course, the reader would need to reveal how many row-groups it contains

I've done this for other bindings (e.g. https://geoarrow.org/geoarrow-rs/python/latest/api/io/functions/#geoarrow.rust.io.ParquetFile and https://kylebarron.dev/parquet-wasm/classes/bundler_parquet_wasm.ParquetFile.html). I'd say it's in scope but I'm not sure of a good way to expose both sync and async methods from a single class.

  • some to_buffers kind of method exists to expose the internal buffers of an arrow structure, in the order defined in the arrow docs; also the corresponding from_buffers

This should be relatively easy. I already started prototyping this in https://github.com/kylebarron/arro3/pull/156, which adds a Buffer class that implements the Python buffer protocol. So to_buffers would export a list of arrow buffers according to the spec. I think ArrayData::buffers should export the buffers in the order expected by the C Data Interface, so we should be able to reuse that.

Other nice to haves (and I realise you wish to keep the scope as small as possible)

The goal isn't to have the scope strictly as small as possible, but rather have minimal new code here. E.g. if the functionality already exists as a compute function in arrow::compute, then it should be easy to bind for export here, and thus is in scope.

The core module (arro3.core) should stay small, but e.g. the arro3.compute module can bring in more of the underlying Rust compute functions.

  • str and dt compute functions

dt compute functions are pretty easy to implement. See date_part. substring and regexp also exist.

  • parquet filter

This is a bit more complex because we'd need more work on the binding side to be able to express the parquet filter to pass to Rust. Let's work on the other topics first and come back to this.


PRs for these are welcome if you're interested; preferably one at a time