spiraldb / vortex

A toolkit for working with compressed Arrow in-memory, on-disk, and over-the-wire
Apache License 2.0
92 stars 5 forks source link

Rename flatten -> canonicalize + bugfix + a secret third thing #402

Closed a10y closed 1 week ago

a10y commented 2 weeks ago

Have a couple of threads in progress, figured I'd break this out into its own PR to make reviews easier

  1. Rename *Flatten* -> Canonical / canonicalize
  2. Fix a bug with struct_array_to_arrow where it wasn't doing the conversion deeply (test added)
  3. Bit of a driveby, but implemented the ArrayAccessor for BoolArray (test added)
  4. Add an extension to DataFusion SessionContext to load a Vortex array into a dataframe
a10y commented 2 weeks ago

I eliminated the flatten_XYZ methods on Array, and instead made the into_canonical() (formerly flatten()) step explicit to make it clear to users of the API what's going on. I realize this might be controversial.

So e.g. before you'd do

array.flatten_primitive()?

and now you'd do

array.into_canonical()?.into_primitive()?

In general, I feel like users of the API are going to want a clean onramp(s) into the vortex world, and then a clean offramp back to Arrow to be exported to other tools. Canonical is the closest thing we have to the interchange for things that want to exit the Vortex world. It's a little bit weird that Canonical still needs to be deeply flattened to get back to Arrow, idk if there should be another type that deeply flattens but keeps everything in Vortex land.