spiraldb / vortex

An extensible, state-of-the-art columnar file format
https://vortex.dev
Apache License 2.0
1.01k stars 28 forks source link

Reading a Vortex array in Polars is slower than Parquet #1071

Open danking opened 1 month ago

danking commented 1 month ago

Polars reads the Parquet encoded dataset in 800ms whereas Vortex takes 4s.

reproduction

Assuming you have the PBI parquet dataset downloaded, write a Vortex file:

import vortex
import pyarrow as pa
import pandas as pd
import polars as pl

vtx = vortex.array(
    pa.Table.from_pandas(
        pd.read_parquet('bench-vortex/data/PBI/CMSprovider/parquet/CMSprovider_1.parquet')))
compressed = vortex.encoding.compress(vtx)
vortex.io.write(compressed, 'bench-vortex/data/PBI/CMSprovider/vortex/CMSprovider_1.vortex')

Parquet takes 800-900ms:

%%time
ds = pa.dataset.dataset(
   'bench-vortex/data/PBI/CMSprovider/parquet/CMSprovider_1.parquet'
)
lf = pl.scan_pyarrow_dataset(ds)
lf.collect()

Once the VortexDataset PR lands, you'll see that Vortex takes around 3s:

%%time
import vortex
import polars as pl

ds = vortex.arrow.VortexDataset(
    'bench-vortex/data/PBI/CMSprovider/vortex/CMSprovider_1.vortex'
)
lf = pl.scan_pyarrow_dataset(ds)
lf.collect()
danking commented 1 month ago

It seems like most of the added time over converting to Arrow is just Polars cost, but polars does do a rechunk is around 12% of total time. I'm not sure why Polars is rechunking our array. Maybe coming from here or here?