Open martindurant opened 3 hours ago
- you could select columns for reading from parquet, or, even better, select from the schema hierarchy in general for deeper structured datasets
This is relatively easy to do; you just have to map input into a ProjectionMask
.
- you allow reading row-group X from a parquet dataset; this would allow for distributing the work to threads or even a cluster. Of course, the reader would need to reveal how many row-groups it contains
I've done this for other bindings (e.g. https://geoarrow.org/geoarrow-rs/python/latest/api/io/functions/#geoarrow.rust.io.ParquetFile and https://kylebarron.dev/parquet-wasm/classes/bundler_parquet_wasm.ParquetFile.html). I'd say it's in scope but I'm not sure of a good way to expose both sync and async methods from a single class.
- some to_buffers kind of method exists to expose the internal buffers of an arrow structure, in the order defined in the arrow docs; also the corresponding from_buffers
This should be relatively easy. I already started prototyping this in https://github.com/kylebarron/arro3/pull/156, which adds a Buffer
class that implements the Python buffer protocol. So to_buffers
would export a list of arrow buffers according to the spec. I think ArrayData::buffers
should export the buffers in the order expected by the C Data Interface, so we should be able to reuse that.
Other nice to haves (and I realise you wish to keep the scope as small as possible)
The goal isn't to have the scope strictly as small as possible, but rather have minimal new code here. E.g. if the functionality already exists as a compute function in arrow::compute
, then it should be easy to bind for export here, and thus is in scope.
The core module (arro3.core
) should stay small, but e.g. the arro3.compute
module can bring in more of the underlying Rust compute functions.
- str and dt compute functions
dt compute functions are pretty easy to implement. See date_part
. substring
and regexp
also exist.
- parquet filter
This is a bit more complex because we'd need more work on the binding side to be able to express the parquet filter to pass to Rust. Let's work on the other topics first and come back to this.
PRs for these are welcome if you're interested; preferably one at a time
It would be great:
Doing all of this would essentially answer what is envisaged in https://github.com/dask/fastparquet/pull/931 : getting what we really need out of arrow without the cruft. It would interoperate nicely with
awkward
, for example.Other nice to haves (and I realise you wish to keep the scope as small as possible)