spiraldb / vortex

A toolkit for working with compressed Arrow in-memory, on-disk, and over-the-wire
Apache License 2.0
92 stars 5 forks source link

v0 Datafusion with late materialization #414

Closed a10y closed 1 week ago

a10y commented 1 week ago

Unfinished, just opening this as I continue to get things working.

This PR augments the original Vortex connection for Datafusion, with an implementation of filter pushdown that allows us to perform late materialization on as many columns as possible.

Pushdown support will be able to get flagged on/off so we can run benchmarks testing different strategies.

I'm hoping to have an initial version of this with a benchmark harness tonight.

a10y commented 1 week ago

Output of the datafusion_benchmark on my MBP.

Note that vortex-nopushdown-uncompressed should actually be vortex-nopushdown-compressed, and vortex-nopushdown-uncompressed #2 is the actual vortex-nopushdown-uncompressed.

image

Even though this is synthetic data, it still illustrates that decoding overhead is the driving factor in execution time.

There's also some latency between uncompressed Vortex with no pushdown and Arrow with no pushdown, but that time is roughly the ~130µs it takes to do the Vortex -> Arrow conversion (benchmarked that separately, not in the repo).

robert3005 commented 1 week ago

Right now we don't run the filters on compressed data which would probably be the thing to fix. Anyway, this seems fixable

a10y commented 1 week ago

I agree. I'm going to address the last few comments of your original review and then convert this to "Ready for review"