Open danking opened 1 month ago
Polars reads the Parquet encoded dataset in 800ms whereas Vortex takes 4s.
Assuming you have the PBI parquet dataset downloaded, write a Vortex file:
import vortex import pyarrow as pa import pandas as pd import polars as pl vtx = vortex.array( pa.Table.from_pandas( pd.read_parquet('bench-vortex/data/PBI/CMSprovider/parquet/CMSprovider_1.parquet'))) compressed = vortex.encoding.compress(vtx) vortex.io.write(compressed, 'bench-vortex/data/PBI/CMSprovider/vortex/CMSprovider_1.vortex')
Parquet takes 800-900ms:
%%time ds = pa.dataset.dataset( 'bench-vortex/data/PBI/CMSprovider/parquet/CMSprovider_1.parquet' ) lf = pl.scan_pyarrow_dataset(ds) lf.collect()
Once the VortexDataset PR lands, you'll see that Vortex takes around 3s:
%%time import vortex import polars as pl ds = vortex.arrow.VortexDataset( 'bench-vortex/data/PBI/CMSprovider/vortex/CMSprovider_1.vortex' ) lf = pl.scan_pyarrow_dataset(ds) lf.collect()
It seems like most of the added time over converting to Arrow is just Polars cost, but polars does do a rechunk is around 12% of total time. I'm not sure why Polars is rechunking our array. Maybe coming from here or here?
Polars reads the Parquet encoded dataset in 800ms whereas Vortex takes 4s.
reproduction
Assuming you have the PBI parquet dataset downloaded, write a Vortex file:
Parquet takes 800-900ms:
Once the VortexDataset PR lands, you'll see that Vortex takes around 3s: